Genomic sequencing method

ABSTRACT

A method of determining the nucleotide sequence of a DNA molecule of arbitrary length as a single procedure by sequencing portions of the molecule in a fashion such that the sequence of the 5&#39; end of the succeeding contiguous portion is sequenced as the 3&#39; end of its preceeding portion is sequenced, for all portions, where the order of contiguous portions is determined by the sequence of the DNA molecule. Sequencing of the individual portions is accomplished by generating a family of polynucleotides under conditions which determine that the elements are partial copies of the portion and are of random nucleotide length on the 3&#39; and 5&#39; ends about a dinucleotide which is an internal reference point; determining the base composition and terminal base identity of each element of the family and solving for the sequence by a method of analysis wherein the base composition and terminal base data of each element is used to solve for a single base of the sequence by assigning the base to either the 5&#39; or 3&#39; end of the partial sequence about the internal reference point as the entire sequence of the portion is built up from a dinucleotide.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a Continuation-in-Part of my co-pending applicationSer. No. 681,842 filed Dec. 14, 1981, now abandoned.

BACKGROUND OF THIS INVENTION

Deoxyribonucleic acid (DNA) is the primary genetic material. DNA is aninformational molecule which encodes all of the proteins which make up aliving organism. This capacity it serves in all living organisms.

DNA consists of two intertwined polynucleotide chains, the double helix.Each chain is a polymer made up of nucleotides. The nucleotideconstituents consist of a sugar, a phosphate and a nitrogenousheterocyclic base. Interchain pairing between the bases, via hydrogenbonding, holds the two chains together in the helical coil.

The nucleotides of DNA contain the sugar 2-deoxyribose and aredesignated deoxyribonucleotides. Nucleotides which contain the sugarD-ribose are called ribonucleotides: these are the building blocks ofribonucleic (RNA), the intermediary transcript of DNA which serves asthe actual template for protein synthesis.

In all nucleotides the sugar moiety is attached to the nitrogenous basevia the glycosidic carbon (1' carbon of the ribose). This combination ofsugar and base is called a nucleoside. Phosphorylation of a nucleosideat the 5' carbon of the sugar gives a nucleotide. The backbone of theDNA polymer is formed of phosphodiester bonds between 2'-deoxynucleoside5'-monophosphates. (3', 5'-phosphodiester bridges)

The nucleotides of DNA differ only in the nitrogenous base. There aretwo types of nitrogenous bases in nucleotides, pyrimidines and purines.The pyrimidines are uracil, thymine and cytosine (abbreviated U, T, andC respectively). Nucleotides containing uracil are found primarily inRNA whereas thymine is found in DNA. The major purines are adenine andguanine (abbreviated A and G respectively). In the DNA helix,complementary DNA chains are held together by base pairing. Thesugarphosphate backbones are on the outside of the DNA molecule and thepurine and pyrimidine bases on the inside. Adenine (A) can pair onlywith thymine (T) while guanine (G) can pair only with cytosine.

The genetic information of DNA is stored in the linear sequence of thefour nucleotides. Most nucleotides along a strand of DNA make up geneswhich code for specific polypeptides. The nucleotide sequence of DNA isread in groups of three nucleotides. Each triplet is a "code-word", orcodon, for an amino acid. As there are 4 different nucleotides in DNA(distinguished by the bases A, T, G, and C) there are 64 differentcodons. These codons comprise the entire genetic code. Most of thecodons designate an amino acid; some serve as start and stop signals forprotein translation. The genetic code is degenerate because there ismore than one codon for most of the amino acids. For example, the aminoacid alanine is coded for by the codons CGA, CGG, CGT and CGC. (In RNA,the triplets are GCU, GCC, GCA and GCG.)

Several techniques have been developed for determining the nucleotidesequence of DNA. Among the more widely practiced are the methods ofMaxam and Gilbert and of Sanger.

In the DNA sequencing technique of Maxam and Gilbert, a segment of DNAis labeled at one end with radiolabeled phosphate. The labeled DNA isdivided into four samples and each sample is treated with a chemicalthat specifically destroys one or two of the four bases in the DNA. The"nicked" molecules are then treated with piperidine which breaks the DNAbackbone at the site where the base has been destroyed. This generates aseries of labeled fragments the lengths of which depend on the distanceof the destroyed base from the labeled end of the segment. The labeledpolynucleotides are separated according to size on an acrylamide gel.The gel is autoradiographed and the patterns of bands on the X-ray filmdetermine which base was destroyed to produce each radioactive fragment.From this information the position of the destroyed bases can bedetermined and the overall sequence of the DNA deduced.

The DNA-sequencing technique of Sanger is an enzymatic procedure whichentails the synthesis of radiolabeled DNA polynucleotides from the DNAstrand to be sequenced. Chain-terminating dideoxynucleoside triphophatesare used to stop synthesis at a particular nucleotide. A Sangersequencing reaction includes a DNA strand to be sequenced, a labeledprimer complementary to the end of that strand, a carefully controlledratio of one particular dideoxynucleotide with its normaldeoxynucleotide and the other 3 deoxynucleotides. When DNA polymerase isadded, normal polymerization begins from the primer; when adideoxynucleotide is incorporated, the chain is terminated. This resultsin a series of labeled polynucleotides whose lengths depend upon thelocation of a particular base relative to the end of the DNA strand.

Four separate polymerase reactions are conducted each containing onetype of dideoxynucleotide. Radiolabeled fragments are separated by sizeon an acrylamide gel. The pattern of polynucleotides gives the DNAsequence.

The methods of Sanger and Maxam and Gilbert are convenient ways ofsequencing single fragments of DNA of about 400 base pairs or less inlength. To sequence large pieces of DNA, overlapping fragments ofsuitable length must be generated and sequenced individually. Theoverlapping sequence information among fragments provides the sequentialrelationship of the fragments so that their relative order can beassigned. From this information, the entire sequence of the parent pieceof DNA can be pieced together.

With these approaches it is apparent that as the length of the DNAstrand to be sequenced increases, the probability of obtaining fragmentsof a DNA strand which overlap sufficiently to eliminate the randomoccurrence of corresponding sequence decreases. Consequently, the numberof randomly generated fragments of the strand necessary for accuratesequencing increases. For DNA strands even two orders of magnitude lessthan the length of a human chromosome the number of randomly generatedfragments is immense. Thus, the technique is impractical for sequencingmolecules of this size.

The Sanger technique of sequencing long strands of DNA requires DNAcloning procedures because DNA polymerization requires a primer. This isovercome by cloning the fragment into a vector so that it is contiguouswith a region of known sequences so that the complementary primer may beprovided. Because of this dependency on cloning technology, theprocedure is difficult to automate.

DISCLOSURE OF THE INVENTION

This invention pertains to a method of sequencing DNA which entails thegeneration of a specific set of polynucleotides from the DNA to besequenced, determining the nucleic acid base composition and terminalbase of these molecules, and solving the base sequence of the moleculesand thus, the sequence of the DNA, by an algorithm called the matrixmethod of analysis.

The molecules generated from the DNA to be sequenced comprise familiesof polynucleotides. Each family corresponds to a segment of the DNA tobe sequenced and is made up of a longest polynucleotide (the length ofwhich is selected to be within the analyzable limit of the procedureused to determine base composition and identity of the terminal base)and shorter polynucleotides which form a "sequential subset" of thelongest polynucleotide. Grouped heirarchically from the longest to theshortest polynucleotide, each polynucleotide of the family isprogressively one nucleotide shorter than the preceding polynucleotideand has the same sequence except that it lacks the one nucleotide. Afurther restraint on the elements of the family is that there is aspecific dinucleotide of the sequence contained in each element. Themolecules can be envisioned as being built around an "axis" which is atthe mid position of the common dinucleotide. The "axis" constitutes aninternal reference point. The polynucleotides vary around the "axis",each containing one less nucleotide on the 3' or 5' end than its longerpredecessor in the group. All such molecules are included in the family,from the longest to the shortest, a dinucleotide.

In the preferred embodiment of the invention, the polynucleotides areRNA/DNA hybrid molecules generated from the DNA to be sequenced. To formthese hybrids, DNA to be sequenced is broken into fragments and eachfragment used as a template to form one or more RNA transcript(s). TheRNA transcript(s) is then extended on the original intact DNA templatewith deoxyribonucleotides to form the DNA portion of the hybrid(s). Theextension is terminated randomly by addition of dideoxynucleotides tothe polymerase reaction. This yields RNA/DNA hybrid molecules which are"random" in length at the 3' end. The molecules can then be randomizedat the 5' (RNA) end preferably by using an RNA exonuclease which underappropriate conditions, degrades the 5' RNA portion. The result of thisprocedure is a family of polynucleotides having the characteristic setforth above. The "axis" referred to above is the dividing line betweenthe RNA and DNA and it immediately follows the 3' most ribonucleotide ofall the hybrid molecules.

The sequence of the DNA portion from which each family ofpolynucleotides has been made can be solved by determining the basecomposition (the number of A's, T's, C's and G's) and the identity ofthe 3' terminal base of each polynucleotide of the family. A preferredmethod of analyzing polynucleotide to obtain this qualitative andquantitative data is mass spectroscopic analysis.

In the preferred embodiment the fifth position of the pentose of thenucleotides of the polynucleotide are mass labeled with isotopes ofcarbon, hydrogen, and oxygen in such a fashion that each possiblenucleotide and terminal nucleotide releases a distinct mass labeledmolecule (such as formaldehyde) from the fifth position of the pentosewhen the polynucleotide is degraded to 3' nucleotides or nucleosides andreacted with periodic acid. The relative abundance of the liberatedmolecules of different massess corresponding to different bases isrecorded using a mass spectrometer. The intensities of the signals whichcorrespond to the different bases are normalized with the signalcorresponding to the terminal base which serves as the internal standardwith a signal of unity. The base composition and internal base identityare given by the normalized data.

The composition and terminal nucleotide data of the elements of eachfamily of polynucleotides are used to solve the sequence of thecorresponding DNA portion template by a method of first generating allpolynucleotides which can be obtained from a guessed solution of thesequence by successive removal of a 3' or 5' nucleotide consistent withthe data of the change in composition between set elements and with thefurther constraint that a specific dinucleotide of the sequence must bepresent in all polynucleotides. The terminal nucleotide data is used todetermine if a subset of the hypothetical family of polynucleotidesexists such that the elements have a one to one correspondence with thedata of terminal nucleotide as well as composition. If no such subsetexists, the process is repeated for improved guesses until convergenceto the correct solution for the sequence occurs.

An algorithm which performs this analysis by testing for the validity ofa guess for part of the sequence while solving for the remaining partusing the composition and terminal base data independently to executebinary hypothesis testing decisions compatible with computer logic isthe matrix method of analysis algorithm.

The matrix method of analysis is analogous to solving a system of nequations in n unknowns where the knowns are: 1) the structuralproperties of the polynucleotides, 2) the base composition and theidentity of the terminal base, 3) the change in composition and changein terminal base between a polynucleotide and the next in the family.Knowledge of the 5' half of the sequence may or may not be known. Adifferent form of the matrix method is used depending on whether thesequence of the 5' half of the polynucleotides is known extrinsically.The method exploits the given information by implementing a reiterativeprocedure to find a path through a matrix of the possiblepolynucleotides having sequences consistent with the data. Finalassignment of the sequence is made when the entire path findingprocedure can be accomplished without contradictions between sequenceassignment and actual data.

A major feature of the sequencing procedure of this invention is that iteliminates the obstacle of overlap among sequenced fragments of DNAoccurring by chance. Because of the manner in which the polynucleotidesare generated, each DNA fragment sequenced can be placed in proper orderin relationship to all other fragments because a sufficient portion the5' end of the flanking DNA fragment is elucidated as the sequence of anyDNA fragment is determined. This permits the proper ordering ofsequenced DNA fragments of a large DNA molecule to yield the entiresequence of DNA. Thus, an entire gene or even an entire chromosome canbe sequenced from one set of restriction enzyme fragments of the gene orchromosome.

For sequencing fragments of DNA containing greater than 400 nucleotides,the methods of Maxam and Gilbert and Sanger rely on the chance overlapof restriction fragments which are isolated from a digest of the DNA tobe sequenced. If small fragments are lost during the isolationprocedure, then these molecules will not be sequenced; therefore,certain sequence information may be lost on account of the deletions.The strategy of the present invention, however, circumvents thisshortcoming because the procedure to sequence any given restrictionfragment solves for the sequence of the restriction fragment andsequences into the 5' region of the contiguous restriction fragment; anyshort fragments which are lost during isolation procedures aresequenced. Thus, no small deletions occur in the solution of thesequence by this method.

In addition, both strands of the template can be sequencedsimultaneously; the sequence for any one strand can be verified by thesequence obtained for the antiparallel strand.

Another important attribute of the procedure is that it can be automatedwith instrumentation. Separation procedures can be automated withinstruments such as automated electrophoresis and fraction collectors. Amass spectrometer with a response time of 10 msecs can scan 100molecules in a second and NMR equipment which can be used to scan NMRlabeled molecules has the capacity to scan 80,000 samples in less thanone second. The sequence can be solved from the data obtained by usingthe matrix method of analysis programmed into a high speed computer.Thus, this method constitutes a procedure which is automatable tosequence DNA rapidly on a large scale.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic illustration of the Gaussian distribution ofpolynucleotide reaction products.

FIG. 2 is a flow chart of the steps of the Best Mode of Carrying Out theInvention.

FIG. 3 is a flow chart of the steps of Procedures IV and V ofAlternative Modes of Carrying Out the Invention.

FIG. 4 is an example of a configuration matrix used to solve a sequenceby the Matrix Method of Analysis I.

FIG. 5 is a lattice used to solve the first rectangular matrix of FIG.4.

FIG. 6 is a lattice used to solve the fourth rectangular matrix of FIG.4.

FIG. 7A is an example of NMR data from fragment #15 of FIG. 4.

FIG. 7B is an example of mass spectroscopic data from fragment #15 ofFIG. 4 where the data was obtained of labeled CO₂ using scheme 1 asdescribed in the Determination of the Base Composition and Terminal BaseIdentity by Mass Spectrometry Section.

FIG. 8 is an automation scheme which parallels the flow chart of FIG. 2.

FIG. 9 is a diagram of the electro-optical ion detector array massspectrometer.

FIG. 10A is a diagram of the electrophoresis apparatus.

FIG. 10B is a cross section of a tube gel, manifold, and glass beadwell.

FIG. 10C is a diagram that illustrates the collection of the glass beadsin receiving vessels.

FIG. 10D is a diagram of the concentric disks containing wells ofactivated glass beads which rotate independently to collectelectrophoresed nucleic acid.

FIG. 10E is a diagram of a single unit of the electrophoresis apparatuswhere DNA is collected from the units through outlets.

PRINCIPLES BEHIND STRATEGY OF THE METHOD OF THE INVENTION I. Background

Two major techniques have been developed to sequence DNA, a chemicalone, the Maxam and Gilbert technique, and an enzymatic one, the Sangertechnique. The principle of the former involves labeling the 5'-hydroxylof a DNA strand with ³² P and establishing chemical conditions thatcause the DNA strand to break at the site of occurrence of one specificnucleotide, either A, T, G, or C. The net effect is to producepolynucleotides of differing lengths and the length of thesepolynucleotides reflects the occurrence of a base at a specific positionin a DNA strand.

The principle of the latter technique involves copying a DNA templatefrom a predetermined position on the template and terminatingtranscription where the terminating nucleotide is known from thespecific reagent used during transcription. DNA polymerase I, a 5' to 3'DNA polymerase which requires a free 3'OH, is used with a primer tosynthesize a radioactive single strand copy of a template DNA strandwhose sequence is to be determined. To each of four reaction mixtures, ablocked derivative of one of the four different nucleotide substrates isadded. These reagents are 2', 3' dideoxy analogues of each nucleosidetriphosphate; they lack a 3' hydroxyl group. They block furtherextension of any chain into which they are incorporated. As a result, aseries of polynucleotides is formed whose length reflects the positionin the polynucleotide of a base corresponding to the analogues.

In both techniques, the reaction products are analyzed by separatingthem according to length by electrophoresis on a polyacrylamide gel andexposing the gel to x-ray film. The sequence can be read directly frombottom to top of the film. Advancing from one position of radiopacity tothe next represents the addition of one nucleotide, and the identity ofthat nucleotide is assigned from the nature of the reaction thatproduced the radiopaque band.

These popular methods of DNA sequencing rely on the formation of a setof molecules all of which have a common property, a reference point atthe 5' end of the molecule. Each member of the entire set has thefollowing properties: they are superimposable on a portion of the parentmolecule of the set of unknown sequence and the nature of thesuperimposition is such that all of the molecules can be superimposed onthe parent when the 5' end of any molecule of the set and the 5' end ofthe parent molecule are aligned. The molecules make up a subset of theparent where subset denotes the relationship of identical sequence tosome portion of the parent molecule and a total number of nucleotidesless than or equal to that of the parent while maintaining a common 5'end. The entire subset refers to a group of molecules differing inlength by one nucleotide ranging in length from one nucleotide to thenumber of nucleotides of the parent.

Molecules which differ in length by one nucleotide can be separated on apolyacrylamide gel. For any molecule, up to about 400 base pairs inlength, a molecule that is identical except for possessing an additionalnucleotide at the 3' end, will have a slower electrophoresis velocity.Thus, the smaller polynucleotide will migrate closer to the anode thanthe longer. Thus, if the last nucleotide of the molecule at the 3' endis known and all molecules have the same 5' end and are superimposable,the sequence can be determined from the knowledge of the identity of thelast nucleotide. This is the basis of sequencing by the Maxam Gilbertand Sanger techniques.

Another strategy can be envisioned where the set of molecules are oflength less than or equal to the parent molecule, are superimposablebeginning from the 3' end of the parent and contain elements of lengthfrom one nucleotide to the number of the parent. In this strategy thecommon reference point is the 3' end. Utilizing the capability ofseparating of molecules differing by a single nucleotide and theknowledge of the last nucleotide on the 5' end, this procedure couldserve to assign the identity of any 5' nucleotide as the length of thepolynucleotide increases by one. The sequence of the parent can beassigned by reiteration of this procedure.

Catalyzed chemical reactions are used to generate the set of moleculesin the case of the Sanger technique and chemical cleavage methods areemployed to generate the polynucleotide set in the Maxam and Gilberttechnique. Chemical or enzymatic reactions of these types have not beendeveloped for the strategy where all the elements of the polynucleotideset have a common 3' end. However, because the approach is essentiallythe same, except that the opposite end of the parent molecule is chosenas the reference point, any limitations of the Maxam and Gilbert andSanger techniques would probably apply to this stratgey.

All of the described techniques are inadequate for sequencing DNAmolecules of length greater than about 400 nucleotides due to inherentshortcomings. To sequence DNA of 10⁵ nucleotides, for example, by any ofthe above strategies, a set of molecules would have to be generatedwhich contains all set elements that differ in length by one nucleotidefrom length one to 10⁵ and the molecules would have to be separatable ona polyacrylmaide gel. Both requirements result in failure of the abovementioned methods.

Chemical reactions often yield a Gaussian distribution of reactionproducts. For the Maxam and Gilbert and Sanger techniques, setgeneration is dependent on chemical reactions for which the yield as afunction of length is approximately a Guassian distribution, and forboth methods, the average length is about 200 nucleotides.

In addition, molecules migrate at a velocity which is inverselyproportional to their molecular weight. As the molecular weightincreases, the difference in migration velocity decreases. Thus,molecules that differ by one nucleotide and range in length from onenucleotide to a length greater than 400 nucleotides can not be reliablyseparated by polyacrylamide electrophoresis to produce a band patternfrom which the sequence can be determined. For these reasons, thesemethods are not appropriate for sequencing DNA molecules greater than400 nucleotides in length.

To overcome these limitations, an approach of making overlappingfragments of the parent molecule has been widely implemented. The mostcommon method of generating the fragments has been by cleavage withrestriction enzymes. These fragments are individually sequenced by themethods of Maxam and Gilbert or Sanger and the total sequence is puttogether by overlap of the sequences of the fragments. However, thisstrategy also fails for sufficiently large molecules because of inherentlimitations.

With present technology, a single fragment of DNA which contains lessthan 400 base pairs can be sequenced using the Sanger method of primedsynthesis or the method of sequencing with base specific chemicalcleavages of Maxam and Gilbert. These methods are described with otherless predominant methods in Methods of Enzymology, vol. 65, 1980, pp.497-701. Both methods are convenient for single fragments of DNA;however, to sequence a very large piece of DNA, both methods rely onoverlap of fragments of the larger piece which are independentlysequenced. Sets of fragments are generated to produce overlappingsequence information, where the relationship of one set of fragments toanother is unknown; any overlap of fragments from different sets is dueto chance. With this approach it is apparent that as the length of theDNA fragment to be sequenced even approaches two orders of magnitudeless than the size of a human chromosome the probability of obtainingsets of fragments whose sequences overlap each other sufficiently torule out random overlap, so that the relative order can be assigned andthus allow the assignment of the sequence of the entire parent molecule,goes to zero; thus, the number of different randomly generated sets ofdiffering fragments that are sequenced individually goes to infinity.

Furthermore, the popular method of primed synthesis depends on DNAcloning. DNA polymerase I extends primers from a 3' OH where the primeris bound to the complementary region at the 3' end of the templatemolecule which is the complement of the molecule which is sequenced. Oneof the difficulties with the Sanger technique is that suitable primersmust be generated which will bind to only the 3' end of the templatewhich is of unknown sequence. This problem can be circumvented bycloning the template into a region of a vector such that both its 5'end, in one orientation, and its 3' end in another orientation arecontiguous to a region of the cloning vector of known sequence.Transcription is initiated from a primer which is complementary to thisregion.

Methods using cloning strategy are set forth in Methods of Enzymology,vol. 101, 1983, pp. 3-122. The strategy of generating random sets offragments that are individually sequenced until enough sets have beensequenced to provide overlap to assign accurately the relative order ofthe fragments is discussed in New M13 Vectors for Cloning--ShotgunCloning, Methods of Enzymology, supra pp. 43-47. Sequencing procedureswhich depend on cloning as an integral part of the procedure are verydifficult if not impossible to automate.

II. Strategy of the Sequencing Method of the Invention

The method of this invention is a readily automatable method whichcircumvents the problems inherent in the Maxam and Gilbert and Sangertechniques. The strategy is to create a group of molecules which containa reference point which is internal. Initially, location of thereference point is unknown, but it exists in all of the molecules. Themolecules are a family of polynucleotides comprising complementarycopies of a portion of the parent molecule from which they are generatedand are superimposable on the parent by alignment of this internal pointof reference. The location of the point of reference or "axis," and thesequence of the parent molecule is solved for simultaneously by analgorithm called the matrix method of analysis.

The family of polynucleotides can be thought of as being all moleculeswhich result from the sequential loss of nucleotides from the 5' and 3'end of the longest polynucleotide of the group. An ordered pattern ofterminal nucleotide change and nucleotide compositional change occursbetween members of sequential subsets. This algorithm exploits thepattern of ordered or systematic nucleotide compositional change andterminal nucleotide change that a designated longest polynucleotide witha given internal reference point and given nucleotide loss constraintscan produce.

III. Criteria of Polynucleotides

The nucleotide sequence of a DNA strand can be solved by generating afamily of polynucleotides overlapping portions of the DNA to besequenced. Each family of polynucleotides forms a "sequential subset" ofthe longest polynucleotide of the group. The molecules are identicalless one nucleotide from either the 5' or 3' end of a given molecule,and the former are defined as sequential subsets of the latter. A familyof molecules with this intra member relationship is defined as a propersequential subset. Also, if all such families are consideredcollectively, then another type of sequential subsets can be defined. Afamily of molecules can be selected from this group which are"sequential subsets" in the sense that any molecule of the familycontains the exact same base composition less one nucleotide relative toanother molecule of the family. The former is defined as an impropersequential subset of the latter if it does not contain the exact samesequence as the latter less one nucleotide from the 3' or 5' end. Thefamily is defined as "improper sequential subsets" and only onepolynucleotide of a given length is present in the family.

The molecules can be depicted as follows:

    K.sub.n, . . . K.sub.4 K.sub.3 K.sub.2 K.sub.1 X.sub.1 X.sub.2 X.sub.3 X.sub.4 . . . X.sub.n

where the series K₁, K₂, K₃, K₄. . . K_(n) ' represent the nucleotidesof the polynucleotide 5' to the internal reference point, or axis, andthe series X₁, X₂, X₃, X₄ . . . X_(n) represents the nucleotides of thepolynucleotide on the 3' side of the axis. The 5' end with respect tothe axis is designated as the "known" portion of the molecules (thisdoes not necessarily imply that this sequence is initially known), andthe 3' end of the polynucleotide is designated as the "unknown" portion.Thus, K₁, K₂, K₃, K₄ . . . represent the "known" sequence and X₁, X₂,X₃, X₄ . . . represent the "unknown" sequence. The distinction is thatin the matrix, as described below, K₁, K₂, K₃, K₄ . . . appear asnucleotides, whereas the X's represent variables (define). Thenucleotides of the "known" portion can be known extrinsically or theycan be guessed.

There are at least three strategies of polynucleotide generation whichyield a population of molecules from which sufficient data can bederived and analyzed by the proper form of the matrix method of analysisto solve for the sequence of the parent molecule uniquely.

A. Strategy I

According to Strategy I the polynucleotides are governed by thefollowing constraints. No polynucleotide contains X₂ without containingX₁. In general terms, no polynucleotide contains X_(n) withoutcontaining X_(n-1), X_(n-2), . . . X₁. In addition, no polynucleotidecontains K₂ without containing K₁. That is, no polynucleotide containsan unknown without containing all preceeding unknowns and, everypolynucleotide contains all succeeding knowns if it contains any givenknown. As a set, all the polynucleotides satisfy these criteria and varyrandomly at the 3' and 5' ends.

The criteria can be represented symbolically as follows:

    X.sub.n → X.sub.1 (X.sub.n implies X.sub.1)

    K.sub.n → K.sub.1 (K.sub.n'  implies K.sub.1)

. . K_(n) '-X_(n) . . . (The polynucleotides are random at the 5' and 3'ends; the knowns and unknowns are variables where K=Known, X=Unknown,n'=1 to 4 . . . and n=1 to 4 . . .)

B. Strategy II

All polynucleotides of a family conform to the following criteria. As instrategy I, no polynucleotide contains X₂ without containing X₁ and nopolynucleotide contains X₃ without containing X₂ and X₁, and so on. Ingeneral terms, no polynucleotide contains X_(n) without containingX_(n-1), X_(n-2), . . . X₁. However, no restrictions are placed on the"known" nucleotides K₁, K₂, K₃, K₄ . . . As a set, all polynucleotidesconform to these criteria and vary randomly at the 3' and 5' ends.

These criteria can be represented symbolically as follows:

X_(n) →X₁ (X_(n) implies X₁)

K_(n') →K₁ (K_(n') does not imply K₁)

K_(n') -K_(n") (The knowns are 3' and 5' random; the knowns arevariables where the range of n' and n" is 1 to 4 . . .

K_(n') -X_(n) (the polynucleotides are 5' and 3' random; the knowns andunknowns are variables, where K=known; X=Unknown, n=1 to 4 . . . andn'=1 to 4 . . .)

C. Strategy III

All polynucleotides of each family must conform to the followingcriteria. No polynucleotide contains X₂ without containing X₁ and nopolynucleotide contains X₃ without containing X₂ and X₁, and so on. Ingeneral terms, no polynucleotide contains X_(n) without containingX_(n-1), X_(n-2), . . . X₁. However, no restrictions are placed on the"known" nucleotides; that is all possible polynucleotides of known withunknowns is possible as long as the first rule is satisfied.Furthermore, a change in known nucleotides of any two polynucleotides ofthe same composition implies that they have different terminalnucleotides. All polynucleotides adhere to these criteria and varyrandomly at the 3' and 5' ends.

These criteria can be represented symbolically as follows:

X_(n) →X₁ (X_(n) implies X₁)

K_(n') →K₁ (K_(n') does not imply K₁)

For K-X_(T1) and K'-X_(T2), ΔK and ΔK', X_(T1) ≠X_(T2)

(Given two polynucleotides which begin with knowns and end with unknownsand have the same composition, a change in knowns for both implies thatthe terminals are not the same.)

K_(n') -X_(n) (the fragments are 5' and 3' random; the knowns andunknowns are variables, where K=known; X=unknown; n'=1to4 . . . andn=1-4 . . .)

An entire set of polynucleotides is defined as a family which satisfiesall criteria of one of these three strategies. By determining the basecomposition and terminal base for sequential subsets, the sequence ofthe polynucleotides, and therefore, the sequence of the original DNAstrand (the parent), can be solved by the matrix method of analysis.

IV. Principles of Matrix Method of Analysis

The basis of the matrix method of analysis is that there exists in allmolecules, as described, a reference point which is a nucleotideposition relative to the parent. These molecules compromise a sequentialsubset of a polynucleotide copy of a segment of the parent DNA.Nucleotides are lost from the 5' or 3' end. All the set elements aresuperimposable on the parent about this point of reference. The changein composition and terminal nucleotide and the order in which thischange occurs is unique moving from one set element to the next, fromthe longest to the shortest element of the set.

The matrix method of analysis is a method that exploits the informationthat is known from the design of the polynucleotide generating reactionsand the data obtained from the polynucleotides. The information is asfollows:

1) the composition and change in composition of a polynucleotide as aresult of random loss of one nucleotide from either the 3' or 5' end ateach step; 2) the constraints on the random 3' and 5' loss describedunder criteria of polynucleotides; 3) the identity of the terminal baseand the change in terminal base at each step; and 4) the order at whichthese changes occur.

The matrix method of analysis entails setting up a rectangular matrixwhere the designated longest polynucleotide appears at position (1,1).The sequence of one half of this molecule is "known" and the nucleotidesequence at the other one half of the molecule is designated "unknown"and represented by variables. The term "known" does not necessarilyimply that the nucleotide sequence at the parent molecule is knowninitially. The division between the "knowns" and "unknowns" is theinternal reference point. The location of the reference point is notnecessarily known initially and can be changed by changing the knowns sothat this sequence superimposes a different region of the parentmolecule. That is, when the sequence is solved, it will besuperimposable upon a region of the parent and the location of theinternal reference point will be fixed. The location on the parent is atthe line dividing the "knowns" and the "unknowns". If the 5' end of thesequence (and consequently the entire sequence) were superimposable on adifferent region of the parent, the location of the internal referencepoint would be different. Thus, the location of the internal referencepoint relative to the parent molecule is determined by the "knowns".

An exemplary matrix is shown below for polynucleotides which conform tothe criteria set forth for Strategy I. For a designated longestpolynucleotide which contains a total of eight (8) nucleotides thematrix consists of 5 rows and 4 columns. ##STR1##

The matrix columns contain polynucleotides which have lost nucleotidesat the 5' end; the rows are formed of polynucleotides which have lostfrom the 3' end. Nucleotides are lost from the 5' end down any columnand lost from the 3' end across any row. The matrix is constructed suchthat all the constraints governing the polynucleotides are satisfied andall possible polynucleotides are recorded in the matrix according to thedescribed format.

The determination of the sequence of the polynucleotides proceeds asfollows: starting at position (1,1) in the matrix, the base which hasbeen lost is determined by the difference in base composition betweenthe longest polynucleotide and the next longest of the set. The changeis consistent with a move to position (1,2) and/or (2,1) of the matrix.The step is repeated for each polynucleotide of the family. These movesare down a column and/or across the row from left to right. Moves down acolumn or across a row from left to right are designated from/to moves.The result can be recorded, e.g. in a "lattice" which contains allcoordinate positions arranged in levels such that each successive levelfrom top to bottom corresponds to all possible from/to moves, and eachsuccessive level from bottom to top corresponds to all possible to/frommoves. A to/from move is a movement up a column and/or across a row fromright to left.

    ______________________________________                                        General Lattice                                                                ##STR2##                                                                                   Lattice                                                         Polynucleotide                                                                              Coordinate position                                             ______________________________________                                        K.sub.4 K.sub.3 K.sub.2 K.sub.1                                                             15                                                              K.sub.3 K.sub.2 K.sub.1                                                                     25                                                              K.sub.2 K.sub.1                                                                             35                                                              K.sub.1       45                                                              ______________________________________                                    

For each step, the base which could have been lost from the 3' or 5' endis determined and the appropriate move to a position in the matrix ismade. This establishes the appropriate path in the matrix which can bedesignated by connecting the corresponding coordinates in the lattice.This procedure is repeated until all consistent from/to moves arerecorded in the lattice. At least one path is formed from coordinateposition (1,1) to a point of convergence, i.e., a coordinate positionfrom which no further from/to moves can be made.

The next step is to determine which path is the correct path. This isaccomplished by starting at a point of convergence and determining whichto/from steps for all single or binary decisions are consistent with theterminal base data as moves are made back to position (1,1) from thepoint of convergence. Assignment of a base to the 3' or 5' end is madeby a to/from move which does not contradict the change in base. For allto/from moves, if the path that is chosen from one coordinate to anothercorresponds to a move across a row from right to left, then the base isassigned to the 3' end which is consistent with the move. That is thebase change determined from the data occurred from the 3' end thereforethe base lost is assigned to the 3' end. A contradiction arises if thisassignment is inconsistent with terminal base data for thepolynucleotide represented at the coordinate position or if the changein terminal base for this step is inconsistent with the data.

For all to/from moves, if the path that is chosen from one coordinate toanother corresponds to a move up a column then the base change for thatstep indicates which base to assign to the 5' end. A contradiction wouldarise if the next "known" up the column in the matrix is different fromthat indicated by the base change.

The sequence is solved when at least one path is found from (1,1) to apoint of convergence by from/to moves and to the (1,1) position from thepoint of convergence by to/from moves at each data step withoutcontradictions.

The matrix method of analysis yields a unique solution for a matrix ofall possible polynucleotides of size (1/2M+1, 1/2M) that conform to theconstraints for polynucleotides of Strategy I, or of size (1/2M+1,M) forpolynucleotides of Strategy II, for any set of data of M-1polynucleotides that are successively one nucleotide less and aresequential subsets from M-1 nucleotides to a dinucleotide. (The longestpolynucleotide is M nucleotides in length.)

The key to the matrix method of analysis is that there is convergence toat least one of the terminal possibilities (point in the matrix at whichno further from/to moves can be made). It may converge to more than one(e.g., if the sequence contained only A, or T, or C, or G bases, then itwould converge to all possible termini of the matrix that yields thesolution of the sequence). Once any terminus is determined to becorrect, it can serve as an initiation point, that is, a point, orcoordinate position from which the initial to/from move is made. Aterminus representing a single nucleotide or single variable in thematrix is correct if it is consistent with the data. The sequence can bedeciphered by making decisions at branch points and by taking the returnpath that is determined to be correct by the data, i.e. the terminalbase and the change in the terminal base at each step. If more than onepath is correct, anyone of the correct paths will yield the sequence.

For a given set of polynucleotides which terminate in the leftmostterminus ) (K₄ K₃ K₂ K₁) (see lattice, page 25,) if the data containsK₁, K₂, K₃, K₄, then this coordinate (1,5) can not be excluded as aninitiation point. But, if the sequence K₄ K₃ K₂ K₁, is known then thisis an initiation point from which to initiate the path which gives thesequence.

In the special case, where only proper subsets, (i.e., polynucleotideswhich have the same sequence but differ by one nucleotide) were chosenand the 5' knowns are assigned from extrinsic information, if thesequence of the polynucleotide at the step at which a terminus isreached is not known but contains the same base composition as theterminus polynucleotide represented in the matrix, then this path can bedetermined to be correct if no other path exists or is consistent withthe data. That is, it is proven correct by exclusion. Furthermore, ifthe data converges in this fashion, and also to one or more otherpossibilities then, a terminus containing no variables may be aninitiation point, but it may also not be. The initiation point cannot bevalidated unless the sequence is known extrinsically at this point. Inall cases, an initiation point is chosen that is consistent with thedata; there will always be at least one. And, in the special casedescribed, if it does not exist directly then it exists by exclusion.

In the case where polynucleotides conform to the rules of Strategy IIwhere all possible polynucleotides of the knowns may be present in theset to be analyzed, these polynucleotides would occupy the appropriateposition in the matrix and the additional coordinates would become partof the lattice at the proper level. Thus, sequential subsets couldresult in convergence to one or more of these new coordinate positionswhich would serve as initiation points. This is discussed in more detailunder matrix method of analysis II. A special matrix and lattice isdiscussed under matrix method of analysis I.

For a population of molecules with special properties specified inStrategy III, a special form of the matrix method of analysis will yielda unique solution where information that was required to solve thesequence with data from a population of molecules generated withdifferent constraints are not needed under these special conditions. Ingeneral, the "knowns" of any given parent molecule at position (1,1) canbe defined by two different procedures which have implications as to thetype of polynucleotides which can be generated.

The "knowns" can be assigned extrinsically. That is, a reiterativeprocedure can be implemented such that when the 3' half of the sequenceof one set of polynucleotides is solved the 5' half, the "knowns", forthe adjoining group of polynucleotides is solved. When a method is usedwhich does not solve for the next contiguous half, the 5' half("knowns") can be guessed. As described above, in essence, this amountsto a guess of the position of the axis on the parent molecule.Alternately, the axis in the case where the "knowns" are assignedextrinsically may be guessed independently which conversely determinesthe "knowns."

Sequential subsets, as defined previously, can be polynucleotides of theexact same composition and sequence as the predecessor minus onenucleotide, i.e. a proper sequential subset, or each polynucleotide ofthe set could be a molecule that has the exact same composition minusone nucleotide, but a different sequence, an improper sequential subset.All of these variables have implications with regard to the nature ofthe set of generated molecules and the analysis of these molecules.

As will be described under Matrix Method of Analysis I and Matrix Methodof Analysis II, implementation of the Matrix Method of Analysis requiresthat the solution for the position of the "axis" be guessed (MatrixMethod of Analysis II) or requires that both the position of the "axis"and the "knowns" must be guessed (Matrix Method of Analysis I).Furthermore, a proper set of sequential subsets must be guessed for bothmethods. Also, a guess can be made for the position of the axis, and theset of proper sequential subsets with or without the composition of the5' one half of the polynucleotide being known extrinsically and anunambiguous solution can be found by the matrix method only if thecorrect guesses are put forth.

From data, the sequence is verified by the absence of contradictions;conversely, fictitious data can be proposed from guesses for thesequence other than the correct sequence and the fictitious data may beconsistent with the base composition and the terminal base of each truepolynucleotides, but the matrix generated from the incorrect guess willnot yield an unambiguous solution using the real data. The matrix methodof analysis reconstructs the sequence in the exact order that thenucleotides are successively eliminated from the 3' and 5' end. Fromother guesses a set of polynucleotides consistent with the data may beproposed but a different order of elimination must be imposed and thematrix method will only yield the correct order for the correct matrixwhich is unique to each guess. Also, the four data properties are uniquedue to the way the polynucleotides were generated and the combination ofa unique data set and matrix yields contradictions unless all variablesare guessed correctly. When that occurs, the sequence is unambiguouslydetermined. If a contradiction is encountered, depending on the strategyof analysis and strategy of polynucleotide generation, one or more ofthe following procedures is followed: new sequential subsets are chosen,the position of the axis is shifted, or a new guess is made for the"knowns".

For example, if the 5' end of the parent is known initially and animproper sequential subset is chosen, then a contradiction will arise.The molecule has the same composition and reference point as a propersequential subset but when it is aligned with the reference point on theparent and superimposed, it is readily apparent that the 3' and 5' endsof the improper sequential subset occurs at a different positions alongthe parent then a proper sequential subset. The matrix converges on apath which is an ordered record of the nucleotides that were lost fromthe 3' and 5' end relative to the axis; therefore, assigning an impropersubset to a matrix position corresponding to a proper one gives rise toa contradiction in later analysis. The ordered loss from the 3' and 5'end must be different relative to the reference point for thispolynucleotide because it is a different polynucleotide and this will beapparent in the analysis when a from/to step is forced in the matrixwhich commits the molecule to a path that represents a different losspattern from that which actually occurred due to the analysis constraintthat any polynucleotide which contains X_(n) must contain all previousunknowns including X₁. The contradiction will be noted with one of theto/from moves. At this point a new set of sequential subsets is guessed.When all the variables are guessed correctly, no contradicitions willarise and the sequence is unambiguously assigned.

EXAMPLES OF SOLVING SEQUENCES BY THE MATRIX METHOD OF ANALYSIS

To further illustrate the matrix method of determining sequence,examples of its application are given below. In each example a matrixfor a polynucleotide family of eight nucleotides in length is shown. Thelattice diagram shows all possible matrix from/to moves consistent withthe change in composition data. The column labeled "path" represents thepossible to/from moves in the matrix which are consistent with theterminal base data and the change in terminal base. The path whichdetermines the solution to the sequence is read from bottom to top.

                                      EXAMPLE 1                                   __________________________________________________________________________     ##STR3##                                                                         1           2         3       4                                           1.  ATTCX.sub.1 X.sub.2 X.sub.3 X.sub.4                                                       ATTCX.sub.1 X.sub.2 X.sub.3                                                             ATTCX.sub.1 X.sub.2                                                                   ATTCX.sub.1                                 2.  TTCX.sub.1 X.sub.2 X.sub.3 X.sub.4                                                        TTCX.sub.1 X.sub.2 X.sub.3                                                              TTCX.sub.1 X.sub.2                                                                    TTCX.sub.1                                  3.  TCX.sub.1 X.sub.2 X.sub.3 X.sub.4                                                         TCX.sub.1 X.sub.2 X.sub.3                                                               TCX.sub.1 X.sub.2                                                                     TCX.sub.1                                   4.  CX.sub.1 X.sub.2 X.sub.3 X.sub.4                                                          CX.sub.1 X.sub.2 X.sub.3                                                                CX.sub.1 X.sub.2                                                                      CX.sub.1                                    5.  X.sub.1 X.sub.2 X.sub.3 X.sub.4                                                           X.sub.1 X.sub.2 X.sub.3                                                                 X.sub.1 X.sub.2                                                                       X.sub.1                                                       Composition                                                                            Terminal                                           Lattice           Data   Δ                                                                         Nucleotide                                                                          Path                                                                             Sequence                                  __________________________________________________________________________     ##STR4##         3T,2C,1G,2A 3T,2C,1G,1A 3T,2C,1G 2T,2C,1G 1T,2C,1G                            1T,1C,1G 1C,1G 1G                                                                     A A T T C T C                                                                    A A T C C G G G                                                                   1,1 2,1 2,2 2,3 3,3 3,4 4,4 5,4                                                  ATTCGCTA TTCGCTA TTCGCT TTCGC TCGC                                            TCG CG G                                  __________________________________________________________________________

                                      EXAMPLE 2                                   __________________________________________________________________________     ##STR5##                                                                          1           2          3         4                                       1.   AGTCX.sub.1 X.sub.2 X.sub.3 X.sub.4                                                       AGTCX.sub.1 X.sub.2 X.sub.3                                                              AGTCX.sub.1 X.sub.2                                                                     AGTCX.sub.1                             2.   GTCX.sub.1 X.sub.2 X.sub.3 X.sub.4                                                        GTCX.sub.1 X.sub.2 X.sub.3                                                               GTCX.sub.1 X.sub.2                                                                      GTCX.sub.1                              3.   TCX.sub.1 X.sub.2 X.sub.3 X.sub.4                                                         TCX.sub.1 X.sub.2 X.sub.3                                                                TCX.sub.1 X.sub.2                                                                       TCX.sub.1                               4.   CX.sub.1 X.sub.2 X.sub.3 X.sub.4                                                          CX.sub.1 X.sub.2 X.sub.3                                                                 CX.sub.1 X.sub.2                                                                        CX.sub.1                                5.   X.sub.1 X.sub.2 X.sub.3 X.sub.4                                                           X.sub.1 X.sub.2 X.sub.3                                                                  X.sub.1 X.sub.2                                                                         X.sub.1                                                 Composition                                                                            Terminal                                             Lattice         Data   Δ                                                                         Nucleotide                                                                          Path                                                                              Sequence                                   __________________________________________________________________________     ##STR6##       3T,2G,1C,2A 3T,2G,1C,1A 3T,1G,1C,1A 3T,1C,1A 2T,1C,1A                         1T,1C,1A 1C,1A 1A                                                                     A G G T T T C                                                                    G G G T T T A A                                                                   1,1 1,1 2,1 2,1 3,1 3,1 3,2 3,2  3,3 4,2                                      4,3 4,3 4,4 4,4 5,4 5,4                                                           AGTCATTG GTCATTG TCATTG TCATT TCAT CAT                                        A A    AGTCATTG GTCATTG TCATTG TCATT                                                 CATT  CAT CA A                      __________________________________________________________________________

V. Approaches to Overlapping Sequences

A. Preferred Approaches

At least two approaches exist for generating polynucleotides whichconform to the criteria set forth above and provide overlapping portionsof the DNA to be sequenced.

In one approach where the polynucleotides conform to the criteria setforth under Strategy III, the DNA to be sequenced is digested intorestriction fragments using the same procedures that are used insequencing long polynucleotides by the conventional Maxam and Gilbertand Sanger techniques. However, there is a major departure from thestrategy and methods of the conventional techniques from this point.From each DNA restriction fragment an RNA transcript(s) is/are made. Thetranscripts are then extended using the original DNA as template.Extension on this template produces overlapping sequence data so thatthe sequence data from the single set of restriction fragments can beordered appropriately. In other words, the RNA copy serves as a primerfor extension into the region that the appropriate RNA copy made fromthe 5' contiguous restriction fragment, would bind. Because all of theRNA transcripts are extended, overlap data is generated for all copies,and at the same time, for all restriction fragments, because theinformation is complementary. Therefore, the entire parent molecule canbe sequenced from one set of restriction fragments. The method isdescribed in more detail below.

The second approach of producing overlapping polynucleotides is to usethe principle that the graph of the average polymerization reactionproduct length as a function of time has a positive slope which can bedetermined experimentally. For any given reaction time concentrations ofthe polymerization reaction products of various lengths will beapproximately Guassian in distribution and centered about an averagelength. (See FIG. 1). Manipulation of reaction parameters can alter theaverage and standard deviation of the yield distribution of thepolynucleotide lengths. Because polymerization proceeds in a 5' to 3'direction the sequence can be solved successively by using a reiterativemethodology where the reaction products of reaction n are related to orare used to determine the distribution of reaction products of reactionn+1, where n is the reaction number and a higher number indicates adistribution of products with a longer average length.

Various schemes of polynucleotide generation can be devised to producepolynucleotides described in Strategy I and II and which exploit theprinciples of the time dependence of the polymerization reaction and theprinciple that the products from reaction n can be used to determine thereaction product distribution of reaction n+1.

In its simplest form, polymerization reactions can be run for increasingperiods of time with mass labeled nucleotides. That is, the larger thereaction number, n, the longer the polymerization reaction occurs. Thereaction is timed so that a series of overlapping Guassian productdistributions are obtained as illustrated in FIG. 1. Then, a portion ofthe reaction products from reaction n is transfered to reaction n+1, andthe transferred polynucleotides bind to 5' end of the template and the5' end of all longer polynucleotides are displaced by the shortertransfered polynucleotides. All displaced DNA is selectively digestedand a set of polynucleotides random at the 3' and 5' ends is generated.The polynucleotides of this set are separated on the basis of size andnucleotide composition. Each polynucleotide of the set contains all ofthe preceeding "unknowns" if it contains any unknown and all succeeding"knowns" if it contains any "known" and the axis is at the position ofthe 3' terminal of the longest polynucleotide from reaction n.

To sequence by this second strategy of generating overlappingpolynucleotides, the final polynucleotide distribution must be such thatif all of the polynucleotides were superimposed on the parent, then thesolved for region of the "unknown" part of the parent, which is lessthan or equal to the total "unknown" region of superimposition, and the"known" region must be superimposed to at least this extent. "Known" inthis case refers to the sequence that has been elucidated in previousreactions. Because of the nature of the last method described above forpolynucleotide generation, the shorter the smallest polynucleotides inreaction n and n+1, the greater the total area of superimposition on theparent. Another way to look at this effect is to consider that shorterpolynucleotides cause the total number of different polynucleotidesgenerated to increase. To decrease the total number of different finalpolynucleotides, the shorter polynucleotides can be selectivelyeliminated. Procedures that follow represent schemes for polynucleotidegeneration all of which contain a method to eliminate smallpolynucleotides. The extent of this elimination can be controlled tocreate the desired distribution of lengths in the final polynucleotideswhich are those from which data is obtained.

A scheme which uses this approach to polynucleotide generation toproduce polynucleotides of Strategy II, is to generate primers anddisplacers by running the polymerization reactions for varying lengthsof time; transferring the products from reaction n to n+1, and thendigesting all displaced DNA and the part of the template which does nothave any annealing DNA. The primers and displacers are produced bymaking a copy of the template which is not digested with and withoutmass labeled nucleotides and using half of the products as the primerswhich are mass labeled to extend with mass labeled nucleotides on thetemplate and half which are not labeled as displacers to randomlydisplace the 5' end of the extended products. The displaced DNA isdigested. The resulting distribution is random at the 3' and 5' end,contains all previous unknowns for any given unknown and the axis is atthe position of the 3' terminal from the longest polynucleotide fromreaction n.

A third method is to run successive RNA polymerization reactions for alonger time. The products are then transferred from reaction n toreaction n+1 and the longer polynucleotides are products are allowed todisplace the shorter polynucleotides; the longer polynucleotides and theannealed polynucleotides are precipitated selectively; the shorterpolynucleotides are removed; the precipitated polynucleotides areresuspended and the transcripts are extended with mass labeleddeoxynucleotides and terminated with mass labeled dideoxynucleotides andthe primers are selectively degraded. The product would be random at the3' and 5' ends and would contain all preceeding "unknowns" for any given"unknown" and the axis would be at the position of the 3' terminal of aprimer.

This precipitate procedure could be implemented in the absence oftransfer from reaction n to n+1. And, the distribution is determined bythe time dependence of the polymerization reaction. These methods arediscussed in more detail below. The latter two methods are shownschematically in FIG. 3.

B. Additional Methods of Generating Polynucleotides

In the previously described methods for generating polynucleotides,which are random at the 3' and 5' end, RNA was extended with DNA and theRNA was hydrolyzed to make the strands 5' random. (This applies to theprimer extender method and the precipitate methods, but not to therestriction enzyme method). An alternative method for forming thesehybrid polynucleotides is to reverse the order. DNA can be extended withRNA and the RNA can be terminated by polymerizing the RNA with RNApolymerase in the presence of a DNA polymerase and the 4dideoxyribonucleoside triphosphates, that are mass labeled, or just DNApolymerase with Mn²⁺ substituted for Mg²⁺. The DNA can then beselectively degraded by the exonuclease that degrades DNA but not RNA.The purpose of this alteration is that it may be possible to polymerizepolynucleotides of DNA that are much greater in length than RNApolynucleotides. Furthermore, in any method using RNA synthesis from aprimer that requires production of very long strands, a viral,bacteriophage, or bacterial RNA promoter can be ligated to the 5' end ofthe DNA to be sequenced and this will have high activity and since thepromoter is not found in human DNA, transcription into RNA will onlyoccur from this point. Examples of bacteriophage promoters whichdemonstrate this phenomenon are Sp6 and Tn7.

BEST MODE OF CARRYING OUT THE INVENTION

This mode is shown schematically in FIG. 2.

I. Isolation of DNA to be Sequenced

DNA can be isolated and purified for the sequencing procedure from cellsby various methods.

From eukaryotic cells, cell nuclei can be separated from other cellularorganelles by differential centrifugation. An aqueous homogenate of thenuclei can be then formed and proteins can be removed from thishomogenate by extraction with phenol or chloroform or a mixture of thesetwo solvents.

If desired, individual chromosomes of eukaryotic cells can be isolatedby methods such as flow cytometry. The chromosomes can then be digestedinto smaller fragments for sequencing by partial enzymatic digestionwith a restriction endonuclease or cleaved by other techniques such asultrasound.

To provide ample DNA for the sequencing techniques, DNA may be cloned.Gene libraries of genomic DNA can be created by taking the total DNA ofa cell or chromosome, digesting it into fragments, inserting the DNAfragments into a cloning vector and introducing the vector into yeast orbacterial hosts. The vector will replicate along with its host andprovide a convenient source of DNA for sequencing. Large DNA fragmentsfor cloning can be generated by random shearing of DNA or by partialdigestion with a suitable restriction enzyme. A variety of cloningvectors can be used to create the gene library. Examples are the yeastartifical chromosomes and bacteriophage vectors. Restriction enzymes maybe used to generate DNA fragments with cohesive ends which bindcomplementary cohesive ends of a cleaved cloning vehicle. The DNAfragment can then be inserted at the cleavage site of the cloningvehicle and yeast or bacterial host transformed with the recombinantvector.

II. Generation of Polynucleotides

The preferred method of generating the polynucleotides which satisfy thecriteria set out above under Strategy I, II or III entails the synthesisof a set of mass labeled RNA/DNA hybrid molecules from randomlygenerated restriction enzyme fragments of the DNA to be sequenced(hereinafter "parent DNA").

A. Preparation of Fragments of Parent DNA

The sample of isolated and purified parent DNA is divided into twoportions. One portion is set aside for use in later steps of theprocedure. The other portion is used to generate restriction fragmentsof an average predetermined length. The restriction fragments aregenerated by digesting the parent DNA with a restriction endonuclease ora combination of restriction endonucleases. The length of the fragmentscan be estimated by calculating the frequency with which a givenrecognition site or sites would be expected to occur randomly in theparent. This is based upon the assumption that the recognition sites forrestriction endonucleases are distributed randomly along the DNA of anorganism. For instance, the tetranucleotide target for the restrictionendonuclease Mbo I will occur about once in every 256 nucleotides.Therefore a complete digest of a DNA strand with Mbo I should providefragments averaging about 256 nucleotides in length. The hexanucleotidesite recognized by Bam HI will appear on average once in every 4096nucleotides.

The restriction enzyme digest is carried out to completion to producefragments of the desired length. The digestion reaction may be repeated,if necessary, to achieve complete digestion.

The preferred average length of the fragments should be in the range ofabout 20 to about 400 nucleotides. The upper limit of about 400represents the limit of reproducible separation of fragments of thissize range by electrophoresis.

The fragments resulting from this restriction enzyme digest are thenseparated according to length. The preferred method of separating thefragments is the standard procedure of agarose or polyacrylamide gelelectrophoresis. Gels containing the appropriate amounts of agarose orpolyacrylamide can be formulated for optimum separation of fragments inthe 20 to 400 nucleotide range. Differential migration of fragments ofdifferent length allows separation and individual elution of theseparated fragments.

Gels can be formulated which separate double-stranded fragmentsdiffering in length by only a single nucleotide.

B. Separation of Complementary DNA Strands

After separation of individual fragments, the complementary DNA strandsof each fragment are dissociated and separated. A suitable method forseparating the strands entails denaturing the fragment with alkali andthen electrophoresing the denatured strand on an agarose orpolyacrylamide gel at neutral pH. Because the two dissociated strandsgenerally have slightly different shapes, and consequently differentmobilities in the gel, the migrate differently in the gel and can beeluted separately.

Other suitable methods of separating the DNA strands includecesium-chloride gradient ultracentrifugation or the poly (U,G) methodand cloning in M13 bacteriophage.

Separated single strands can then be concentrated by precipitation withethanol or isopropanol. The DNA polyanion is not soluble in alcohol.

C. Synthesis of Mass-labeled RNA Copies of Strands

For each isolated single strand of the fragments, an RNA copy isproduced with ribonucleotides labeled with isotopes at the 5th positionof the pentose component of the ribonucleotides. (See below fordiscussion of the theory and procedure for using mass labelednucleotides).

There are at least two procedures for producing RNA copies of thefragments. The first is primed RNA synthesis. Mass-labeledribonucleotide primers are synthesized having a sequence complementaryto the recognition site of the endonuclease used to generate the DNAfragments from the parent DNA. Polymerase reactions are set up in anappropriate buffer each containing the DNA polynucleotide template, themass-labeled primer, the four mass labeled ribonucleotides and RNApolymerase. The primer is paired with the tetra- or hexanucleotiderecognition site of the DNA fragment and provides a site for RNApolymerase imitation of RNA synthesis. The polymerase will synthesize aRNA copy of the DNA fragment.

In the second procedure, which is the preferred method, RNA copies aremade without use of a primer. If ribonucleotides are added to thereaction mixture in high concentration in the absence of sigma factor,which controls initiation of RNA polymerization only from special DNAsequences called promoters, then RNA polymerase initiates RNA synthesisat random points along the DNA template and a free end stimulatesinitiation at that site. The procedure is carried out in substantiallythe same way as for primed RNA synthesis except that no primers areadded and a concentration of about 2 millimolar ribonucleotides is used.

Upon completion of the reactions by either procedure, a DNAase is addedto destroy the template DNA fragment. The resulting RNA transcripts areseparated by molecular size.

A preferred method for this separation step is agarose or polyacrylamidegel electrophoresis. The aim is to isolate the longest RNA copy bycollecting fractions of eluant from the gels frequently enough to permitcollection of discrete RNA bands. Although isolation of the singlelargest RNA transcript is preferable, the procedure tolerates as many asfour different polynucleotides. (See Matrix Method of Analysis I.)

A method of using mass spectroscopic detection of bands containing asingle polynucleotide is described in the section entitled Automatingthe Method.

D. Extension of RNA Transcripts on Parent DNA Template- 3' Randomization

The RNA transcript (or transcripts, up to as many as four) are extended,with the parent DNA serving as template for the extension, withmass-labeled deoxynucleotides to generate RNA/DNA hybrid polynucleotideswhich are terminated randomly on the 3' end (the DNA end). This can beaccomplished by employing the modification of the procedure of Sanger etal.

Polymerase reactions are set up in appropriate buffers, each containingthe RNA transcript (or transcripts), a single-strand of the parent (theoriginal template), all four mass-labeled deoxynucleotides, all fourmass-labeled dideoxynucleotides, and DNA polymerase I. This polymerasewill initiate DNA synthesis at the 3' end of the RNA transcript (whichserves as a primer). Mn²⁺ may be substituted for Mg²⁺ in the reactionmixtures to improve the efficiency of DNA synthesis at the RNA primer.This procedure generates a population of RNA/DNA polynucleotides rangingin size from the original RNA transcript with one dideoxynucleotide onthe 3' end to molecules roughly 400 nucleotides in length.

Both complementary strands of the parent DNA are used as templates. Ineach case, the longest RNA transcripts copied from appropriate strandsof the smaller restriction fragment made from the parent are used.

If non-complementary RNA interferes with subsequent steps it can beremoved by hydrolysis with a 3' to 5' exonuclease that requires a 3' OH,for example, snake venom exonuclease.

Both methods may give rise to either subsets of a single molecule orsubsets of more than one molecule; therefore, the data must be treatedas if subsets are made from the loss of nucleotides randomly from the 3'and 5' end of the parent polynucleotide and the matrix method ofanalysis I must be used as described in this discussion.

A single RNA copy of the restriction fragments is desired. The RNAcopy's length must be greater than half the length of the restrictionfragment or it must be at least 20 base pairs in length and a copy of aportion of the 5' half of the restriction fragment. The latter twoconditions eliminates the possibility that the copy will bind elsewherein the template by random probability and insures that enough of the 5'end will be sequenced to complement the sequence determined for theanti-parallel strand, so that the entire polynucleotide and adjoiningoverlap region with the next restriction polynucleotide is sequenced.Furthermore, if one pure polynucleotide is isolated from either RNA copyprocedure with subsequent steps carried out and mass spectrum dataobtained, then the sequence is solved for by taking sequential subsetsfrom the first mass scan where the change from a subset to one with thesame composition but one nucleotide less in each case assigns the lostnucleotides in order from the 3' to the 5' direction. Using sequentialsubsets from the mass data from the second scan following the 5'randomization procedure, a gain of a nucleotide from the smallest to thelargest polynucleotide assigns those nucleotides in order in the 3' to5' direction.

If more than one RNA copy of the restriction polynucleotide is isolated,then the Matrix Method of Analysis must be used because using theprevious procedure on more than one polynucleotide and overlapping thesequences may result in more than one solution due to interchange ofassignment between polynucleotides and/or shifting of the overlap regionof the polynucleotides. The solution is discussed under Matrix Method ofAnalysis I.

E. Separation of RNA/DNA Polynucleotides

RNA/DNA hybrid molecules produced by the procedure outlined above willbe about 200-400 in number. All molecules are comprised of mass-labeledribonucleotides, deoxynucleotides and a 3' terminal dideoxynucleotide.

The hybrid polynucleotides are then separated by length preferably bypolyacrylamide or agarose gel electrophoresis. Each successive moleculediffering in length by one nucleotide can be resolved.

The RNA/DNA polynucleotides in bands can be collected as separatefractions from the gel. Bands eluted from the gel can be monitored bymass spectrometry as described in the Automating the Method section, byan absorption at wavelengths of 260 and 280 nanometers or by ethidiumbromide fluorescent.

III. Determining Base Composition and 3' terminal Base ofPolynucleotides

The separated RNA/DNA hybrid molecules can be reacted to release labeledformaldehyde which is mass spectroscopically analyzed to determine thebase composition of each polynucleotide and the identity of the 3'terminal dideoxynucleotide.

IV. Detailed Methodology of the Best Mode (Methods I)

The steps of the preferred method have been outlined above. The methodis illustrated in the format of a flow chart in FIG. 2. The detailedmethodology of each step is described below.

Step I: Isolation of DNA to be sequenced.

The DNA to be sequenced is isolated from genomic DNA and is separatedinto two aliquots. One aliquot is set aside for Step III. A restrictiondigest is performed on the other aliquot so that restriction fragmentsof an approximate length of 200 base pairs are obtained. The restrictionfragments are then made single stranded. The methodology follows.

The individual chromosomes of the human genome can be isolated by flowcytometry to provide pure DNA to make a library or a set of restrictionfragments which can be sequenced. See e.g., Gray et al., (1987) Science238, 323-329; Davies et al. (1981) Nature 298, 374-376; Batholdi et al.,Cytometry 3, 395-401; and Leba, (1982) Cytometry --, 145-154. Thechromosomes can be digested with restriction enzymes into fragments oflength which can be estimated by the frequency that a given restrictionsite would occur by chance in a DNA molecule of a given length. Oncecleaved, the DNA fragments are separated by size or composition. Thepreferred method is to separate the fragments by gel electrophoresis,(see e.g. Maniatis et al., Molecular Cloning, a Laboratory Manual, ColdSprings Harbor, 1982 (hereinafter "Maniatis"), pps 173-185, 478). Also,the technique of pulse-field gradient electrophoresis could be used asdescribed in D. Schwartz and C. Cantor, Cell 37, 67 (1984), anddenaturing gradient gels could be implemented. See Fisher and Lerman,PNAS 80, 1579 (1983) and Learman et al. Annu. Rev. Biophys. Bioeng.1984, 13:399-423. Methods of visualizing and removing nucleic acids fromgels are described in Maniatis, p. 163-172; Vogelstein et al. PNAS76(2):615-619. For removing DNA from polyacrylamide gels, see Maniatis,p. 178.

Other methods include CsCl gradient ultracentrifugation, see e.g.,Maniatis, p 93-94, and various chromotographic techniques such as,reverse phase or ion exchange HPLC. For separation of large DNA or RNApolynucleotides by HPLC, see Larson et al., J.B.C. 254:5535-5541(1979);Patient et al. J.B.C. 254:5542-5547(1979); Patient et al. J.B.C.254:5548-5554(1979). For separation of short polynucleotides by HPLC,see Gait et al., Nucleic Acid Res. 10(20), 6243-6254(1982); Dizdarogluet al., J.Chrom. 171:321-330(1979); Pearson et al., J.Chrom.200:137-149(1983 ; Haupt et al., J.Chrom. 260:419-427(1983); McLaughlinet al., Anal.Biochem. 124:37-44(1982); McFarland et al., Nucleic AcidRes.7(4), 1067-1080(1978). Pure polynucleotides from a restrictiondigest can also be isolated by cloning methodologies which allows forisolation and amplification of DNA molecules.

Fragments of exogenous DNA that range in size up to several hundredkilbase pairs can be cloned into yeast by ligating them to vectorsequences that allow their efficient propagation as linear artificialchromosomes, see Burke, et al., Science 236:806-812 (1987).

Several additional types of vectors can be used to clone fragments offoreign DNA and propagate them in appropriate host cells (e.g., E.coli.)These include plasmids, bacteriophage lambda, cosmids, bacterophage M13.These vectors are discussed in Maniatis, pp. 2-54. Propagation andmaintenance of bacterial strains and viruses is discussed in Maniatis,pp. 55-74. Large scale isolation of plasmid DNA is discussed inManiatis, pp. 75-98. Introduction of plasmid and bacteriophage lambdaDNA into E.coli is discussed in Maniatis, pp. 247-268. Construction ofGenomic libraries is discussed in Mantiatiss, pp. 269-308. Plasmid orbacteriophage lambda DNA can be rapidly isolated from plasmids grown asindividual colonies or plaques by methods described in Maniatis, pp.365-373. Cloning methods are also discussed in Methods of Enzymology,vol. 101, 1983, pp. 3-123

The procedure of using restriction digests and separation of the productfragments can be repeated as many times as necessary using anycombination of the above referenced strategies until DNA restrictionfragments of length between 20 and 400 base pairs is obtained.

A complete method for preparation of double-stranded restrictionfragments is given in Methods of Enzymology; vol. 65, 564-565, (seepreparation of restriction fragments).

Isolated double stranded restriction fragments are made single strandedby electrophoresis on strand separating gels as described in Maniatispp. 179-185.

The final isolated purified single stranded fragments can beconcentrated by the method of precipitation with ethanol or isopropanoldescribed in Maniatis, pp. 461-463.

Step II: Preparation of RNA transcripts

An RNA transcript of the single stranded restriction fragment is madewith mass labeled ribonucleotides. (See Determination of the BaseComposition and Terminal Base Identity Section.) There are at least twoprocedures of making RNA transcripts.

Procedure A. RNA copies of the restriction fragments are made by theprimed RNA synthesis procedure described in Methods of Enzymology; vol.65, 1980, pp. 497-499. Mass labeled primers are used. The primers arecomplementary to the restriction enzyme site which was cleaved togenerate the restriction fragment. Mass labeled ribonucleotidetriphosphates are used in place of -32p NTP's. RNA copies are separatedand bands are collected by a fraction collector during electrophoresisor chromatography. The instrumentation for collecting eluted fractionsfrom an electrophoresis procedure is described below in the sectionsentitled Electrophoresis Instrumentation and Automating the Method.

Procedure B. RNA transcripts of the single stranded restrictionfragments are made using E. coli RNA polymerase where the 3' free end ofthe restriction fragment serves as a promoter for core polymerase(Nature, Vol. 223 (1969), pp. 854-55). The polymerization protocol isdescribed in Methods of Enzymology: Vol. 65 (1980), pp. 497-99, with theexceptions that all nucleotide triphosphates are mass labeled, that noprimer is used, and that 2mM nucleotide triphosphates are used tostimulate RNA polymerase to initiate at the 3' end without a primer.Yeast B enzyme may be substituted for E. coli polymerase. In thepresence of denatured DNA the average length of RNA chains synthesizedby the yeast B enzyme are larger than those polymerized by the E. colienzyme. With the former enzyme, the average chain-length growth rate at30° C. is about seven nucleotides per second (The Enzymes, Boyer, Vol.X, p. 309).

For each procedure the largest RNA copy is isolated. It is preferable tocollect fractions frequently so as to isolate a band containing a singlepolynucleotide, generally the largest polynucleotide. However, thesequencing procedure will tolerate four different polynucleotides in theband isolated. If five different polynucleotides are isolated, then adifferent band from the separation procedure may be used or the sequencedetermined for the antiparallel strand can suffice.

If mass spectrometry is used as a method to monitor the bands, then aband which contains a single polynucleotide can be selected when thepeaks corresponding to the possible bases are of integer relativeintensity. This method is described in the Automating the MethodSection.

Step III: Extension of RNA transcripts in DNA

The RNA copies are used as primers and are extended on the DNA templatewith mass labeled deoxyribonucleotides. The extension is terminatedrandomly with mass-labeled dideoxyderivatives so that the primer isextended to a maximum length of approximately 400 nucleotides, and theproduct polynucleotides are separated by methods described in Step IV.See also Electrophoresis Instrumentation.

Mass labeled deoxy- and dideoxyribonucleotides are used and the A, T, C,and G reactions are performed as a single reaction with reactantconcentration in a final concentration as by Smith DNA Sequence Analysisby Primed Synthesis Primers, Methods of Enzymology; vol. 65, pp. 561-580(1980). DNA polymerase I will initiate DNA systhesis at an RNA primer asdescribed by Lee Huang et al., PNAS 511:1022-1028 (1964). Mn²⁺ may besubstituted for Mg²⁺ if DNA extension at the RNA primer occurs with lowefficiency. The effect of Mn²⁺ on the substrate specificity of DNApolymerase I is described in DNA Replication, Kornberg, p. 151 and inMethods of Enzymology; vol. 65, p. 572 (1980).

A portion of the parent and a portion of the complementary antiparallelstrand are sequenced in two separate reactions. In one extensionreaction, the longest RNA copy made from the single stranded restrictionfragment serves as a primer where the double stranded parent serves asthe template. The longest RNA copy made from the complementaryrestriction fragment serves as primer for a separate reaction in whichdouble-stranded parent is used as the template. A method by which thedouble stranded DNA is melted and the primer is allowed to bind toinitiate extension is described in Methods of Enzymology; vol. 65, pp.561-580 (1980). The RNA transcript binds only to the complementarystrand of the dsDNA; thus it serves as a primer only for this strand.Another method to make hybrid polynucleotides by using the RNA copies asprimers is to run two separate extension reactions where both thelongest RNA copy of a given restriction fragment and the longest RNAcopy of the complementary restriction fragment are added to a reactionmixture containing single stranded parent in one reaction and thecomplementary single strand in another reaction. Only the appropriatelycomplementary RNA copy serves as a primer for the extension reactionbecause it will bind to the single stranded template whereas thecomplementary RNA copy will not. However, the complementary RNA copiesmay bind to each other and a method by which the double stranded RNA ismelted and the appropriate primer is allowed to bind so that extensionis initiated only from this primer is described in Methods ofEnzymology; vol. 65, pp. 561-580 (1980).

If the noncomplementary RNA copy interferes with subsequent steps, thenit can be selectively removed following the above procedure byhydrolysis with a 3' to 5' exonuclease that requires a 3' OH. Venomexonuclease digestion can be performed as described by Tu and Wu,Methods of Enzymology; vol. 65, pp. 620-638.

Step IV: Separation of polynucleotides

The extended primers are separated. The approximate range of the numberof hybrids made from each RNA primer is between 200 and 400 and theapproximate length of any given hybrid is less than 400 nucleotides. Thepreferred method of separation is gel electrophoresis as described inMethods of Enzymology; vol. 65, pp. 573-575 (1980). Other methods ofseparation are described in Maniatis, pp. 173-185 and in Methods ofEnzymology; vol. 65, pp. 542-545 (1980).

The fractions can be electroeluted, monitored by mass spectrometry, andcollected as described in the Automating the Method Section. Separatedfractions can also be collected from the gel on glass beads wells oranode wells of a carousel collector as described in the ElectrophoresisSection. The glass bead collection procedure is a modification of MethodII described by Vogelstein et al., PNAS 76, 615 (1979). Glass beads arepresent in the lower anode well and electrophoresis buffer in this wellcontains saturated NaI as described in the above reference. Followingcollection, the polynucleotides adherent to the glass beads can becaused to undergo reaction to release formaldehyde which is recordedwith a mass spectrometer and/or the polynucleotides are eluted from theglass beads by suspending them in a solution in the absence of NaI.

DNA or RNA bands collected as fractions can be monitoredspectrophotometrically at wavelengths of 260 and 280 nm or by ethidiumbromide fluorescent quantitation. See Maniatis, pp. 468-469; 163. Thepreferred method of ethiduim bromide fluorescent quantitation is to useethidium bromide in the gel or in the anode well at a concentration ofabout 0.5 ug/ml. See Maniatis, p. 163. Also, it is not necessary toremove ethidium bromide from RNA or DNA to perform primed synthesis ofStep III. (See Methods of Enzymology; vol. 65, p. 565 (1980)) If themass labeled nucleotides contain in addition a radiolabel, then thebands can be monitored by a scintilation counter.

Step V: Mass Data Acquition

A portion of a selected purified polynucleotide (polynucleotides)sample(s) is (are) reserved for subsequent steps. The selection criteriais described in the Matrix Method of Analysis I Section.

Each purified polynucleotide sample is reacted to form 3' nucleotides ornucleosides which are further reacted to release labeled formaldehydemolecules from the 5' pentose position of each nucleotide or nucleoside.The released formaldehyde is analyzed with a mass spectrometer.

Step VI: 5' Randomization Procedure

The 5' end of the selected purified polynucleotide (polynucleotides) is(are) made random in length by removing ribonucleotides from the 5' endso that the products are molecules of one nucleotide less from thelargest polynucleotide to the molecule containing only DNA or thedinucleotide containing one ribonucleotide and onedideoxyribonucleotide, for the case where the longest polynucleotideconsists of RNA with a terminating dideoxynucleotide.

Method A: The 5' end of the hybrids contains RNA; therefore, they can berandomized by a 5' to 3' exoribonuclease. This can be accomplished forexample by using yeast ribonuclease as described in Enzymes, Dixon andWebb, p. 852 (1979) and in Ohtaka, J.B.C. 54:324-327 (1963).

Method B: Reverse transcriptase possesses a polypeptide whichselectively degrades the RNA of an RNA-DNA hybrid with a partiallyprocessive mechanism. See Maniatis, pp. 128-132 and Gerald, Biochem.20:256-264 (1981).

A partially processive 5' to 3' exonuclease attacks a strand and removesa small number of nucleotides by a processive mechanism and thendissociates to attack a different strand. Thus, extent of hydrolysis atthe 5' end at any polynucleotide is a random event. Therefore, thedesired distribution of 5' randomized polynucleotides is obtained withthis type of enzyme.

An RNAase HI or αB RNAase H reaction is run as described under RNAase Hassays, Gerald, supra, with the exception that either appropriate singlestranded template is added to the reaction mixture containing thepolynucleotide to be 5' randomized, or double stranded parent is added.In the latter case, an annealing procedure is performed to allowseparation of the double stranded template and to allow pairing of thepolynucleotide with the complementary strand of the added DNA template.The annealing procedure is set forth in Methods of Enzymology; Vol. 65,p. 567 (1980). The reaction is terminated after a time sufficient togenerate a distribution of reaction products as described above. A graphof the extent of hydrolyis as a function of time appears in FIGS. 2 and3 of Gerald, p. 260 (1981). These plots are indicative of the timelength of the 5' randomization reaction. The concentration of thetemplate, polynucleotide, and enzyme, and the times of reaction areadjusted to yield the desired distribution of products. The effect ofmanipulation of these variables on the products appears in Gerald, supraand in DNA Synthesis in Vitro, Wells and Inman, pp. 270-330 and inparticular p. 300. Approximate values for these variables are asfollows: 2.4 pmole of polynucleotide, 2.9 pmole of template, 10⁻³ unitof enzyme, in 1 ul volume run for about 5-20 mins. One advantage ofusing RNAase H and added template is that the problem of hydrolysisresistant hair pin loops is overcome.

Method C: The hybrid polynucleotides are 5' randomized using anendonuclease that randomly cleaves RNA and produces a 3' OH. Themolecule cleaved from the 5' end is hydrolyzed with a processive 3' to5' exoribonuclease that requires a 3' OH. Ribonuclease can serve as theendoribonuclease. See Enzymes, Dixon and Webb, p. 856 (1979); HiramuruJ.B.C. 65:701 (1969). Venom exonuclease can serve as the 3' to 5'processive exonuclease that requires a 3' OH. The protocol is describedin Tu and Wu, Methods of Enzymology, vol. 65, pp. 620-638 (1980).

Method D: 5' randomization can be accomplished by mild RNA hydrolysisdesigned to overcome the problem of hair pin loops. See Methods ofEnzymology, vol. 65, pp. 667-669 (1980). This procedure produces singlehits of the RNA and a mixture of 2' and 3' phosphates. The 2' or 3'phosphate can be removed to create a 2' and 3'OH. Then a 3' to 5'processive exonuclease that requires a 3'OH can be used to degrade theRNA polynucleotide which was hydrolyzed from the 5' end of thepolynucleotide. The 2' or 3' phosphate can be hydrolyzed to an OH byalkaline phosphatase. See Enzymes, Dixon and Webb, p. 19; 840. Thephosphatase procedure is described in Maniatis, pp. 133-134. Venomexonuclease can serve as the 3' to 5' progressive exonuclease whichdegrades the RNA polynucleotide which was hydrolyzed from the 5' end ofthe polynucleotide. The reaction is described in Tu and Wu, Methods ofEnzymology, vol. 65, pp. 620-638 (1980).

The preferred methods are methods B and D.

Step VII: Separation of 5' Randomized polynucleotides

The polynucleotides are separated by the procedures described in StepIV.

STEP VIII: Second Mass Data Acquisition

Collected fractions are reacted to release labeled formaldehyde which isanalyzed with a mass spectrometer as described in Step V.

STEP IX: Data analysis

Approximately 400 nucleotides are sequenced from this data using theprocedure discussed under Matrix Method of Analysis I. The data from allthe restriction fragments allows the sequence of the entire parent to bedetermined.

ALTERNATIVE MODES OF CARRYING OUT THE INVENTION

These modes are based upon the second approach for preparing overlappingsequences. The form of the matrix method for solving the sequence withdata from these polynucleotide generating strategies is discussed underMatrix Method of Analysis II.

The following procedure is done as a first step in all subsequentprocedures described.

The first 40 or so nucleotides of parent DNA are sequenced. This can bedone by replicating the template with mass labeled nucleotides andterminating with mass labeled dideoxyderivatives. The dideoxyderivativeof each polynucleotide serves as the internal standard. The replicationpolynucleotides are separated and the base composition determined bymass spectroscopic analysis. The sequence is solved using the sequentialsubsets method. This is discussed under Method of Sequential Subsets,page 95.

A. Procedure I: Direct transfer of mass labeled replication products.

The template is replicated with random termination usingdeoxyribonucleotides and dideoxynucleotides that are each uniquely masslabeled at the 5th position of the pentose. The polynucleotides uniquelymass labeled at the 5th position of the pentose are reacted to releaseformaldehyde from this position which identifies each of the 4 differentbases and the terminal base of each polynucleotide. The conditions ofthe polymerization reaction are controlled such that the products ofreaction n+1 are on average longer than those of reaction n. Forexample, this can be accomplished by controlling the time at whichdideoxynucleotides are added to the polymerization reaction mixture orthe concentration of dideoxynucleotides in the reaction mixture can bemade greater in reaction n than in reaction n+1. A portion of theproducts from reaction n is transferred to n+1 successively for eachreaction, so that the 5' end of the reaction products of reaction n+1are displaced from the template by those from reaction n. The displacedDNA is degraded with S1 nuclease, an enzyme which specifically digestssingle stranded DNA. The polynucleotides are thus made random at the 5'as well as the 3' end. The short polynucleotides are isolated from thetransferred fragments and from the remaining template by a method whichseparates by size and/or nucleotide. Samples are separated by ionexchange or gel filtration chromatography, HPLC, or polyacrylamide gelelectrophoresis. The polynucleotides are reacted to liberateformaldehyde which is analyzed by mass spectrometry, and the sequence isdetermined by the Matrix Method of Analysis II.

B. Procedure II: Mass labeled DNA primers and unlabeled DNA or RNAdisplacers.

The DNA template blocked at the 3' end is partially replicated, startingat the 3' end of the template, using ribonucleotides and RNA polymerasesuch that the reaction products of reaction n+1 are on the averagelonger than the reaction products of reaction n. The replicationreaction is terminated by denaturing the RNA polymerase enzyme.Polynucleotides of random length at their 3' end are produced. Thepolynucleotides are transferred from reaction n to reaction n+1 afterremoving the template from this aliquot and the transferredpolynucleotides are allowed to anneal. The longer polynucleotides willanneal over or displace the shorter ones so that the resulting annealedparts are double stranded in length as great as the longestpolynucleotides from both reactions and are random at the 3' end. S1nuclease is used to degrade single stranded RNA and DNA so that onlythat part of the template that is annealed to RNA remains. The RNA isselectively and completely destroyed. This can be done with RNAase H orbase hydrolysis. Two separate replication reactions are run using theremaining part of the template generated in previous reactions (thepartial template) In one replication reaction an unlabeled RNA or DNAcopy of the partial template is made. When a DNA copy is made, it isgenerated so that it possesses a 5' hydroxyl.

In another separate reaction a mass labeled copy of the partial templateis made. The unlabeled DNA or RNA products from the former reaction arecalled displacers and the mass labeled DNA products from the latterreaction are called primers. The primers are extended on the originaltemplate using mass labeled deoxynucleotides and mass labeleddideoxynucleotides which terminate the replication reaction randomly.The displacer polynucleotides are added to the products of this reactionunder conditions which favor the displacement from the template of the5' end of the extended primers by the displacers. The reaction isincubated with S1 nuclease which destroys single stranded DNA, but notdouble stranded DNA or double stranded DNA/RNA hybrids and produces 5'nucleotides (i.e. leaves a 5' phosphate). An S1 nuclease digestion mayhave to be done prior to transferring the displacer. This will preventthe chance annealing of the displacers or displaced fragments to thetemplate 3' to the replication polynucleotides. Thus, the displaced partthat is single stranded is selectively degraded and the remaining partthat is not degraded possesses a 5' phosphate. The polynucleotides areallowed to denature following inactivation of S1 nuclease.

Next, if DNA displacers are used, the reaction products are incubatedwith spleen phosphodiesterase. The DNA displacers which contain a 5' OHare selectively degraded because spleen phosphodiesterase degrades RNAor DNA completely and requires a 5' hydroxy terminus. The remaining partof the extended primer is not degraded because it contains a 5'phosphate group.

When RNA displacers are used, then the RNA can be removed completely andselectively by base hydrolysis or by hydrolysis with the enzyme RNAaseH.

The labeled polynucleotides are separated by ion exchangechromatography, HPLC, or gel electrophorosis and then reacted to releaseformaldehyde which is analyzed by mass spectrometry. The basecomposition and terminal base identity is determined so that thesequence can be solved using the Matrix Method of Analysis II.

C Procedure III: RNA primers; RNAase H or base hydrolysis.

Partial DNA templates are made following the procedure described underprocedure II. A complementary RNA copy of the partial DNA templates thatvary in length at the 3' end are made by a replication reaction of thepartial templates. The RNA replication products are isolated and used asprimers in a DNA extension reaction. The primers are extended inmass-labeled DNA. Mass labeled deoxyribonucleotides are used duringreplication and competing mass labeled dideoxynucleotides are used toterminate the replication reaction. The RNA is completely andselectively digested with RNAase H for example or by base hydrolysis.The DNA polynucleotides are purified; they are random on the 5' endbecause the primers were 3' random, and they are random on the 3' endbecause replication was terminated randomly. Furthermore, by the natureof replication the condition X_(n) --X₁ is satisfied by all of thesepolynucleotides. Thus, the polynucleotides conform to the criteria ofstrategy II; therefore, data can be obtained by mass spectrometry of theformaldehyde released from the polynucleotides and the sequence can besolved for using the Matrix Method of Analysis II.

D. Procedure IV: Direct precipitation.

An outline of Procedures IV and V are shown schematically in FIG. 3.

The template is partially replicated starting at the 3' end of thetemplate using ribonucleotides and RNA polymerase such that thereplication products of reaction n+1 are on the average longer than thereaction products of reaction n. The replication reaction is terminatedby denaturing the RNA polymerase enzyme or by a precipitation reactionas follows.

The reaction mixture is titrated with a cation, for example calciumwhich will precipitate the RNA replication products and the templatesuch that the large polynucleotides are precipitated preferentially overthe smaller polynucleotides. The cation concentration, which willprecipitate the desired length of polynucleotide while allowing thepolynucleotides shorter than this length to remain in solution can bedetermined experimentally and this predetermined concentration is theend point of the precipitation reaction. It can be monitored with an ionselective electrode for example, or the precipitation reaction can bemonitored via light scattering. The small RNA products remaining insolution are removed.

Next, the precipitate is resolubilized and any excess cation is removedby an exchange reaction with another anion where the cation and anionhave a much lower K_(sp), solubility constant, than that of the cationand the nucleic acid. The cation can also be removed with a chelatingagent. The RNA replication products are used as primers for mass labeledDNA replication. Mass labeled deoxynucleotides are used duringreplication, and competing mass labeled dideoxynucleotides are used toterminate the replication reaction. Following DNA replication, the RNApart of the hybrid molecules is selectively and completely destroyed.The RNA can be degraded selectively with RNAase H or by base hydrolysis.The resulting DNA polynucleotides which are now random at the 3' and 5'ends are purified by the methods of HPLC, ion exchange chromatography orpolyacrylamide gel electrophoresis (see electrophoresisinstrumentation). The polynucleotides are reacted to releaseformaldehyde which is analyzed by mass spectrometry to determine thebase composition and terminal base. The sequence is solved by using theMatrix Method of Analysis II.

E. Procedure V: Precipitation after transfer.

The procedure is the same as for procedure IV, except that thereplication products of any given reaction are transfered to thesucceeding reaction as part of the method to eliminate small replicationpolynucleotides.

The template is partially replicated starting at the 3' end usingribonucleotides and RNA polymerase such that the replication products ofreaction n+1 are on the average longer than the reaction products ofreaction n. The replication reaction is terminated by denaturing the RNApolymerase enzyme.

RNA polynucleotides from reaction n are transferred to reaction n+1 todisplace short polynucleotides with the longer polynucleotides of thedistribution produced in this reaction; the smaller polynucleotides arethen in solution unbound to the large template. A precipitation reactionis carried out to preferentially precipitate the hybridized complexesleaving the small fragments in solution. The precipitate isresolubilized and the replication is resumed in DNA with mass labeledsubstrate. Mass labeled deoxynucleotides in the presence of mass labeleddideoxyterminators, respectively are used. The RNA is selectively andcompletely destroyed with RNAase H or base hydrolysis. Followingpurification of the polynucleotides by polyacrylamide gelelectrophoresis, ion exchange chromatography, or HPLC, and followingmass spectropscopic analysis of released formaldehyde, thepolynucleotides are sequenced by the Matrix Method of Analysis II.

DETAILED METHODS OF ALTERNATIVE MODES (METHODS II)

The following protocols are examples of specific methods to implementthe steps outlined in the above Alternative Modes of Carrying Out theInvention.

A. Isolation of Parent to be Sequenced

The parent molecule to be sequenced can be isolated by the methodologiesdiscussed earlier under Methods I. Restriction fragments with length onthe order of 10⁵ nucleotides are isolated in the present case.Furthermore, a useful method is that of Schwartz and Cantor, Cell37:67-75 (1984) of pulse-field gradient electrophoresis. This technique,which is based on the application of alternating, perpendicularinhomogeneous electric fields to a standard horizontal agarose gel, iscapable of resolving DNA fragments of up to 5000 Kb. It has already beenapplied to the separation of intact DNA strands from yeast chromosomes.This method can be used preparatively to purify very large restrictionfragments (200 to 5000 Kb) produced from a digest with enzymes which cutinfrequently in human DNA. These purified fragments could then bedigested directly with a second, different restriction enzyme and theseproducts cloned in an appropriate vector or DNA fragments of anapproximate length of several hundred Kb can be expanded in yeastartifical chromosomes as described by Burke, et al., Science 236:806-812(1987) prior to a second digest and cloning step. The resulting libraryWould have a low complexity and consist solely of sequences contiguousin the human genome.

B. Primary DNA Copies

DNA copies of the template can be made by first adding a 3'polynucleotide tail to the template with terminal nucleotidetransferase. The method is described in Maniatis, p. 148; pp. 239-240. ADNA primer complementary to the polynucleotide tail is added andextended by the DNA polymerization protocol described in Priminings, DNASequence Analysis by Primed Synthesis, Methods of Enzymology, vol. 65,pp. 561-580. The procedure is also described by Sanger, J. of Molec.Bio. (1975) 94: 441-448. In either case, the reaction is performed inthe absence of dideoxy terminating nucleotides. And polymerization isterminated by other means such as denaturing the enzyme by heat.

Mass labeled nucleotides are used when appropriate as described inAlternative Modes of Carrying Out the Invention. For the procedureswhich require a DNA replication product with a 5' OH, a primercontaining a 5' OH that is also complementary to the polynucleotide tailis used.

C. Primary RNA Copies

RNA copies of the template which all initiate from the same point at the3' end of the template can be made by ligating the template to anefficient bacteriophage promotor, for example SP6. SP6 is a Salmonellaphage that enclodes an RNA polymerase specific for SP6 promoters (Butlerand Chamberlin, J. Bio. Chem. 257, 1982, 5772-5778.) One advantage ofthis in vitro transcription system is that SP6 RNA polymerase initiatestranscription exclusively at the SP6 promoter, thus avoiding transcriptsthat initiate elsewhere. In addition, it is possible to obtain longtranscripts and SP6 RNA polymerase is relatively easy to purify in largeamounts and is remarkably stable.

SP6 can be ligated to the template in the correct orientation bystarting with a double stranded DNA molecule of SP6, ligating a linkerto SP6 and then ligating the linker to the double stranded restrictionfragment of which one strand serves as the template in subsequentsequencing reactions.

Blunt end cDNA copies of SP6 can be generated by incubating SP6 cDNAwith protruding ends with bacteriophase T₄ DNA polymerase or E. Coli DNApolymerase I. These are enzymes that remove protruding 3' singlestranded termini with their 3' to 5' exonuclease activities and fill inrecessed 3' ends with their polymerizing activities. These reactions aredescribed in Maniatis, p. 243. The combination of these activitiesgenerates blunt ended cDNA molecules, which are incubated with a largeexcess of linker molecules in the presence of bacteria phase T₄ DNAligase, an enzyme that is able to catalyze the ligation of blunt endedDNA molecules. The products of the reactions are cDNA molecules carryingpolymeric linker sequences at one of their ends. These molecules arethen cleaved with the appropriate restriction enzyme and ligated withthe restriction fragment that has been cleaved with a compatable enzyme,one that creates ends which are complementary to those created by theformer enzyme. This method as well as others are discussed in Maniatis,pp. 219-220.

Most linkers are supplied by the manufacturer with 5' hydroxyl ends.Because T₄ DNA ligase requires 5' phosphate ends, it is necessary tophosphorylate the linkers before they can be joined to DNA. The kinaseand ligase reactions can be carried out sequentially in the samereaction mixture. T₄ polynucleotide kinase can be used to phosphorylatethe 5' termini of the synthetic linkers prior to ligation. The method isdescribed in Maniatis, pp. 125-126; 394-395. The attachment of syntheticlinkers is described in Maniatis, pp. 396-397.

SP6 with the appropriate sticky ends can be ligated to the restrictionfragment by methods described in Maniatis, pp. 396-397, where thelinkers are molecules of SP6 promoter attached to a linker with stickyends for the restriction fragments. Sticky ends refers to uneven endsthat are created by the endonuclease activity of the restriction enzyme.Complementary ends are produced between two molecules that are cleavedby the same enzyme. Also, SP6 can be bluntly ligated to the restrictionfragment by methods described in Maniatis, pp. 396-397, where SP6 isligated directly to the restriction fragments. Alternatively SP6 can beligated to the linker to form a hybrid SP6-linker molecule which can belinked to the restriction fragment. The restriction fragment can also becloned into a vector that contains the SP6 promoter. This method isdescribed for the PBR322 plasmid in Green, Maniatis, Melton, Cell,32:621-694 (1983).

The procedure for splicing a restriction polynucleotide into a vectorand amplifying the product is described in Maniatis, pp. 51-54; 270-307,and Methods of Enzymology, vol. 101, 1983, pp. 3-123.

RNA transcription is carried out with the SP6 promoter system asdescribed in Green et al. Cell (1983), with the modification that theribonucleotides are not radiolabeled. When a vector system is used, thetemplate is linearized by restriction fragment digestion, (See Green etal., supra, p. 682).

Another method for making an RNA copy of the template initiated onlyfrom the 5' end is to first add a polynucleotide tail to the 3' end ofthe template as described in Method IIB. The complementary RNA primer isadded, and the RNA primer is extended with ribonucleotide triphosphatesusing DNA polymerase in the presence of Mn²⁺. The polymerizationreaction with Mn²⁺ is described in Methods of Enzymology, vol. 65, p.572. The procedure is followed as described in Methods of Enzymology,pp. 561-580 except that ribonucleotides are substituted fordeoxynucleotides, the reaction is run without dideoxynucleotides, Mn²⁺is substituted for Mg²⁺ and the reaction is terminated by denaturing theenzyme or by the addition of EDTA.

D. Removal of Template Before the Transfer of an Aliquot from Reaction nto Reaction n+1

The DNA template can be selectively destroyed in those cases where RNAcopies are transferred from reaction n to n+1 by the followingprocedure. Incubate the aliquot to be transferred with 20 ug of DNAase Ifor 20 minutes. The enzyme is then inactivated by the addition of acid,for example. The solution is neutralized by the same equivalent of base.

The template can also be removed by gel electrophoresis as described inMethod I or gel or DEAE chromatography. DEAE chromatographic techniquesare described in Cloning, A Laboratory Manual. Maniatis, p. 130, p. 166.Gel chromatography is described in Maniatis pp. 464-465, where thereaction buffer is substituted for the equilibration buffer and the sizerange of the gel used is appropriate for the separation of thepolymerization products from the template.

For procedure I a 3' to 5' exonuclease could be used to degrade thetemplate, since the replicated fragments contain a dideoxynucleotide atthe 3' end and are resistant to degradation. The method is describedbelow under 3' to 5' digest which requires a 3' OH.

E. Hybridization Methods

Procedures described in Alternative Modes of Carrying Out the Inventioncontain one or more steps which depend on competitive displacement ofall or a part of one polymerization product by another. In some cases,displacement can be accomplished where it is thermodynamicallyunfavorable by increasing the concentration of the competing strand.Methods describing the competitive binding of short DNA polynucleotidesto DNA templates, thus, partially displacing large DNA polynucleotidesby this mechanism are discussed in Methods of Enzymology, vol. 65, 1980,p. 567. The equivalent procedure that involves the hybridization ofsmall polynucleotides of RNA is described in Hozumi, Tonegawa, PNAS 73(10) 3628-3632 (1976).

For cases involving the total displacement of short RNA or DNApolynucleotides by long polynucleotides of RNA or DNA, respectively, thereaction is thermodynamically favorable; therefore, the reaction iscarried out as described for the respective unfavorable reaction wherethe reaction concentrations of the small polynucleotides is notincreased. Furthermore, RNA can hybridize to double stranded DNA in thepresence of 70% formamide by displacing the identical DNA strand. Thisreaction probably occurs because of the greater thermodynamic stabilityof the RNA/DNA hybrid when it is near the denaturation temperature ofduplex DNA. Thus, an RNA strand can favorably displace a complementaryDNA strand which is all or part of the length of the template. Theprocedure is described in Thomas, White, Davis, PNAS 73 (7) 2294 (1976).Also, see Methods of Enzymology, vol. 65, 1980, pp. 718-749.

F. 3' to 5' Digest with the Requirement of a 3' OH.

DNA can be digested in the 3' to 5' direction by T₄ exonuclease. T₄exonuclease degrades single stranded or double stranded DNA completelyto 5' mononucleotides and requires a 3' hydroxyl. See Kornberg, DNAReplication, p. 321 and Maniatis, p. 118. The reaction procedure isdescribed in Maniatis, pp. 117-120. The reaction is carried out asdescribed in the absence of the four dNTp's.

Also, venom exonuclease can serve as the 3' to 5' processive exonucleasethat requires a 3' hydroxyl. The enzyme releases 5' mononucleotides anddegrades single stranded RNA and DNA completely (see Kornberg, DNAReplication p. 321 and Dixon, Enzymes p. 852.) The procedure isdescribed in TU, WU, Methods of Enzymology, vol. 65, pp. 625-627.

G. 5' to 3' Digest that Requires a 5' OH

Spleen phosphodiesterase requires a 5' hydroxyl terminus and releases 3'mononucleotides and degrades single stranded RNA and DNA completely(Kornberg, DNA Replication, p. 321 and Dixon, Enzymes, p. 852.) Thereaction protocol is described by Tu and Wu, Methods of Enzymology, p.627.

H. Hydrolysis of Single Stranded DNA without Hydrolysis of DoubleStranded DNA

S1 nuclease degrades single stranded RNA or DNA to yield 5' phosphorylmono or oligonucleotides. Double stranded DNA, doubled stranded RNA andDNA/RNA hybrids are resistant, Maniatis, p. 140). The procedure followedto selectively degrade single stranded RNA or DNA to leave a 5'phosphate is described in Favaloro, Treisman, Kamen, Methods ofEnzymology, vol. 65, 1980, pp. 718-749. In particular, see pp. 729-730.

I. Extension of Primers

DNA or RNA primers may be extended with mass labeled deoxnucleotides andterminated with mass labeled dideoxynucleotides as follows. An RNA orDNA primer is extended on single stranded template with mass labeleddeoxynucleotides and terminated randomly with mass labeleddideoxynucleotides. The procedure is described under Primings, Smith,DNA Sequence Analysis by Primed Synthesis, Methods of Enzymology, vol.65, 1980, pp. 561-580. The procedure is also described in Sanger,Journal of Molecular Biology, 1975, 94, pp. 441-448. Mass labeled deoxyand dideoxynucleotides are used and the A, T, C, and G reactions areperformed as a single reaction with final reactant concentrations asdescribed in the article by Smith, supra. DNA polymerase I will initiateDNA synthesis at an RNA primer (see Kornberg, DNA Replication, p. 151;Lee Huang, Cavalier, Proceedings of the National Academy of Science,511, 1022-1028, (1964), Stryer, Biochemistry p. 576). Mn²⁺ may besubstituted for Mg²⁺ if DNA extension at the RNA primer occurs with lowefficiency (See Kornberg, p. 151; Smith, Methods of Enzymology, vol. 65,p. 572.)

DNA primers are extended in RNA by using mass labeled ribonucleotides;the procedure is described in Methods of Enzymology, vol. 65, p.497-499. The RNA extension is terminated randomly by carrying out thereaction in the presence of mass labeled dideoxynucleotides and DNApolymerase I.

J. The Selective Degradation of RNA

RNA can be selectively degraded completely by base hydrolysis. Theprocedure is described in Methods of Enzymology, vol. 65, p. 572. Also,the RNA could be degraded selectively by exoribonuclease H, RNAase H,which catalyzes the exonucleolytic cleavage of the RNA of RNA/DNAhybrids to 5' phosphomonoesters in the 5' to 3' direction. The procedureis carried out as described in Methods I - Method B except that thereaction is allowed to go to completion.

K. Separation of Fragments

Separation of polynucleotides by electrophoresis and/or HPLC isdescribed in Methods I and under Electrophoresis Instrumentation.

MATRIX METHOD OF ANALYSIS I

In reference to Methods I, two procedures of producing RNA copies of thesingle stranded restriction fragments are described. See pages 53-54.Procedure A involves using primed RNA synthesis; Procedure B involvesusing RNA polymerase under conditions that allow the enzyme to initiateRNA polymerization randomly on a DNA template with high activity. Bothmethods may give rise to subsets that have a common 5' end. In this casethe simple strategy described below is used to solve for the sequencefrom the data. But, the replication products may not all have a common5' end; therefore, the data must be treated as if subsets are made fromthe loss of nucleotides randomly from the 3' and 5' end of the largestpolynucleotide and the matrix method of analysis must be used asdescribed in this discussion.

A single RNA copy of the restriction fragment is desired. The RNA copy'slength must be greater than half the length of the restriction fragmentor it must be at least 20 base pairs in length and be a copy of aportion of the 5' one half of the restriction fragment. The lattercondition eliminates the possibility that the copy will bind elsewherein the template by random probability and the former condition insuresthat enough of the 5' end will be sequenced to complement the sequencedetermined for the antiparallel strand, so that the entire fragment andadjoining overlap region with the next restriction fragment issequenced. If one pure polynucleotide is isolated from either RNA copyprocedure with subsequent steps carried out as described in Method I andcomposition and terminal base identity data obtained, then the sequenceis solved for, by taking sequential subsets from the first determinationof the composition and terminal base identity described in Methods I,where the change from a polynucleotide to one with the same compositionbut one nucleotide less in each case assigns the lost nucleotidesuccessively in order from the 3' to the 5' direction. Second, usingsequential subsets from the composition and terminal base identity datafrom the second determination following the 5' randomization proceduredescribed in Methods I, a gain of a nucleotide from the smallest to thelargest polynucleotide assigns those nucleotides successively in orderin the 3' to 5' direction.

If more than one RNA copy of the restriction fragment is isolated, thenthe Matrix Method of Analysis must be used because using the previousprocedure on more than one polynucleotide and overlapping the sequencesmay result in more than one solution due to interchange of assignmentbetween polynucleotides and/or shifting of the overlap region of thepolynucleotides. For example, when any given polynucleotides have thesame base pair composition and terminal base, then their identity islost. That is, it is impossible to distinguish which base was gained bywhich polynucleotide of the next step in the progression.

The matrix used to solve sequences by the Matrix Method of Analysis I isa form which consists of a series of rectangular matrices where the(1,1) positions of all of the matrices are aligned such that they form a"45° line" and all of the matrices at least partially overlap eachother. This line is called the "45° initiation line" and all of thepolynucleotides which abut this line can serve as the longest designatedpolynucleotide at position (1,1) to define a rectangular matrix forimplementing the matrix method as was previously described. The rulesdescribed previously apply to the entire configuration or to anyrectangular matrix which is a part of the entire configuration. That is,that bases are lost from the 5' end down any column and bases are lostfrom the 3' end across any row. The first rectangular matrix containsmore rows and fewer columns than any other rectangular matrix with adifferent designated longest polynucleotide at the designated position(1,1) along the 45° initiation line. As described below, what is uniqueto the designated longest polynucleotide of the first rectangular matrixis that all of the "knowns" are guessed where the "knowns" make up thenucleotide sequence of the 5' one half of the polynucleotide relative tothe "axis" which is in the middle of the polynucleotide. Also, these"knowns" represent the sequence that is complementary to the parent andsuperimposes more 3' on the parent than any other "knowns" in theconfiguration. To use the matrix method of Analysis I to get a uniquesolution, the RNA copies (if more than one exists) must overlap eachother by at least one nucleotide. This is inferred in the solution. Byusing the preferred method, polynucleotides which are 5' randomized mustbe chosen so that their length is greater than that of thepolynucleotide represented at the matrix position (1, 1/2m) in the firstrectangular matrix generated from the 1st designated longestpolynucleotide as previously described. And, polynucleotides 5'randomized must contain different 3' terminal nucleotides. Also, thesmallest hybrid polynucleotides must be 5' randomized. ##STR7##

Example of matrix used to solve the sequence if more than one RNA copyof the restriction fragment is isolated. The first rectangular matrix issolved first then that information is used to move along 45° initiationline and solve for more of the complement to the parent sequence.

In general, polynucleotides are chosen for randomization so that aunique solution can be obtained when the sequential subsets are guessedto be proper and the "knowns" are guessed being that sequence 5' to theaxis is not known extrinsically. The "knowns" are the nucleotides whichrepresent the sequence 5' to the "axis" of the first designated longestpolynucleotide of the first rectangular matrix. And, a solution of thismatrix is attempted by using the data to find from/to moves fromcoordinate position, (1,1), to (1/2M+1, 1/2M) and to/from moves to (1,1)from (1/2M+1, 1/2M) in the absence of any contradictions, as wasdescribed previously. If no contradiction arises and the stated rules ofrandomization are followed, then the sequence is assigned uniquely forthe preferred method except in the case where the smallestpolynucleotides 5' randomized contain the same terminal nucleotide. Inthis case, the ability to find sequential subsets which allow from/tosteps down any column, except the last column, in the solved matrix, inthe absence of any contradiction will verify the sequence. In the solvedmatrix the variables are replaced by the appropriate nucleotides.

Once the 3' end of the sequence, the "unknown" sequence, is solved forin the first rectangular matrix, then this information can be used toassign "known" nucleotides 5' of the new "axis" which is in the middleof the next designated longest polynucleotide along the 45° initiationline. Sequential subsets are guessed as being proper and a solution isattempted. If no set can be found which solves the sequence using thisnew designated longest polynucleotide, then the next polynucleotidemoving along the 45° initiation line is designated the longestpolynucleotide and the solution is reattempted. This is repeated untilall the polynucleotides along the 45° line are exhausted. In general,the solution process consists of moving along the 45° initiation linefrom upper right to lower left and finding proper sequential subsetsthat result in the solution of the rectangular matrix from an initiationpoint along this line to the coordinate position (1/2M+1, 1/2M) where(1/2M+1,M) is the position assigned relative to the first rectangularmatrix of the configuration matrix and back to position (1,1) withoutgiving rise to a contradiction. Movement along the 45° initiation linewill solve for more of the sequence in the 3' direction of thecomplementary sequence of the parent.

In the case of more than one RNA copy, separation by size may yield massdata of mixtures of polynucleotides. However, this data can be solved toyield the composition and terminal nucleotide of the componentpolynucleotides by solving multiple equations in multiple unknowns.Consider an example where 3 RNA copies are isolated, extended in DNA andseparated by size where each band contains 3 polynucleotides ofdifferent composition but the polynucleotides in each band are the samelength. If one of the bands which contains 3 polynucleotides is 5'randomized, then each band resulting from the separation by size of thispopulation may contain 3 polynucleotides each, except for the last 3bands (which will contain 3, 2, and 1 polynucleotide respectively.) Themass data yields 8 equations, 4 from the signal due to A, T, G and C inany polynucleotide and 4 due to A,T,G and C terminal bases. If a bandcontains 3 polynucleotides, then the relative amounts of eachpolynucleotide is unknown; this represents three unknowns. The last bandcontains only one base, the terminal. For the next to the last band 2polynucleotides exist, one in proportion x and the other in proportion ywhere X and Y are variables. The one polynucleotide contains the sameterminal but gains a 5' nucleotide, N, the second contains the same baseas present in band 1 in the 5' position, but gains a terminal, T.Therefore, there are eight equations and four unknowns. The samereasoning applies to the three polynucleotides containing 3 bases. Insuccessive bands each polynucleotide gains a nucleotide 5' to theexisting nucleotides and these 3 bases represent 3 unknowns and therelative proportions of each polynucleotide represents 3 additionalunknowns; therefore, there are 8 equations and 6 unknowns. At the limit,the 8 equations can be solved simultaneously to produce the compositionand terminal base data for 4 polynucleotides in each band. Thisrepresents a capability which will not be exceeded during implementationof the preferred method.

FIG. 4 shows the matrix used to solve the sequence G A C T A C G A T G CC T A G T G C T.

The following fragments shown in FIG. 4 are 5' Randomized #6, #15, #18,#26. The data from this procedure is shown below.

    ______________________________________                                        Data                                                                          Polynucleotide                                                                (RNA underlined)   Composition                                                                              Δ                                                                             Terminal                                  ______________________________________                                            .sup.  .sup.   #6                                                         6   GACTACGATGCCTA     4A3G3T4C   G   A                                       27  .sup.  ACTACGATGCCTA                                                                             4A2G3T4C   A   A                                       28  .sup.    CTACGATGCCTA                                                                            3A2G3T4C   C   A                                       29  .sup.  .sup.   .sup.  TACGATGCCTA                                                                3A2G3T3C   T   A                                       30  .sup.  .sup.   .sup.    ACGATGCCTA                                                               3A2G2T3C       A                                           .sup.  .sup.   #15                                                        15  .sup.  .sup.   GACTA                                                                             2A1G1T1C   G   A                                       31  .sup.  .sup.   .sup.  ACTA                                                                       2A1T1C     A   A                                       32  .sup.  .sup.   .sup.  CTA                                                                        1A1T1C     C   A                                       33  .sup.  .sup.   .sup.  .sup.  TA                                                                  1T1A           A                                           .sup.  .sup.    #18                                                       18  .sup.  .sup.   TACGATGCCTAGTG                                                                    3A4G4T3C   T   G                                       34  .sup.  .sup.     ACGATGCCTAGTG                                                                   3A4G3T3C   A   G                                       35  .sup.  .sup.     .sup.   CGATGCCTAGTG                                                            2A4G3T3C   C   G                                       36  .sup.  .sup.     .sup.     GATGCCTAGTG                                                           2A4G3T2C   G   G                                       37  .sup.  .sup.   .sup.  .sup.     .sup.  ATGCCTAGTG                                                2A3G3T2C   A   G                                       38  .sup.  .sup.   .sup.  .sup.   .sup.   .sup.    TGCCTAGTG                                         1A3G3T2C       G                                           .sup.  .sup.   #26                                                        26  .sup.  .sup.   TACGAT                                                                            2A1G2T1C   T   T                                       39  .sup.  .sup.     ACGAT                                                                           2A1G1T1C   A   T                                       40  .sup.  .sup.   .sup.     CGAT                                                                    1A1G1T1C   C   T                                       41  .sup.  .sup.   .sup.       GAT                                                                   1A1G1T     G   T                                       42  .sup.  .sup.   .sup.  .sup.   .sup.    AT                                                        1A1T           T                                       ______________________________________                                    

The connected coordinate positions in the lattices shown in FIGS. 5 and6 represent all possible from/to moves consistent with the change incomposition data. The heavy lines represent the to/from moves consistwith the terminal base and change in terminal base data as verified inthe matrix.

MATRIX METHOD OF ANALYSIS II

For families of molecules generated by the second approach tooverlapping sequences, the 5' portion of the molecule is knownextrinsically. The reason is that 5' portion of the sequence of anyfamily of polynucleotides corresponds to the 3' portion of the family ofpolynucleotides generated from the adjoining segment of DNA. This is theregion of overlap. Thus, when the entire sequence for any polynucleotidefamily is solved, the 5' end of the downstream set of polynucleotides issolved concomitantly. Thus, the letters K₁, K₂, K₃, K₄ . . . representthe "known" sequence and X₁, X₂, X₃, X₄ . . . represent the "unknown"portion of the sequence to be solved. This means that the sequence ofany DNA strand can be solved sequentially by moving along the strand inthe 5' to 3' direction. As the sequence of one polynucleotide family isdeciphered, the 5' portion of the adjoining family is obtained and soon. This, of course, does not apply to the first polynucleotide familysequenced.

However, there are several strategies for determining the sequence ofinitial polynucleotide family. (See Sequential Subsets Method).Furthermore, it was demonstrated in the Examples of Solving Sequences bythe Matrix Method of Analysis that the set of polynucleotides generatedfor analysis could contain the ##STR8## preceeding "known" in the 5'direction only if it contained the succeeding "knowns". For example, K₄could only be present if K₃ -K₁ were also present in any polynucleotide.Also it was discussed under Strategy I that any succeeding "unknown" canbe present only if all the preceeding "unknowns" were present in thepolynucleotide up to the "axis" dividing "knowns" from "unknowns". Thiscondition must be rigidly adhered to; however, the Matrix Method ofAnalysis can be used to solve uniquely a set of polynucleotides which donot conform rigorously to the first condition; that is, all possiblepolynucleotides of the "knowns" may be present in the set analyzed.

The solution follows the same procedure as before, except that thepolynucleotides of the "known" type which contain base pairs lost in 3'to 5' direction relative to the `axis` are recorded in the productmatrix at the proper position.

The "known" type refers to polynucleotides in the matrix which do notcontain variables. Those polynucleotides are recorded in the lattice atthe level and position appropriate for the corresponding matrixcoordinate position.

An exemplary matrix is shown below for polynucleotides which conforms tothe criteria set forth for strategy II. For a designated longestpolynucleotide which contains a total of eight (8) nucleotides, thematrix is made of 5 rows and 8 columns. ##STR9## An exemplary lattice isshown below for polynucleotides which conform to the criteria set forthfor strategy II. The matrix coordinate positions corresponding topolynucleotides of the "known" type are indicated.

    ______________________________________                                         ##STR10##                                                                                  Matrix                                                          Polynucleotide                                                                              Coordinate Position                                             ______________________________________                                        K.sub.4 K.sub.3 K.sub.2 K.sub.1                                                             15                                                              K.sub.4 K.sub.3 K.sub.2                                                                     16                                                              K.sub.3 K.sub.2 K.sub.1                                                                     25                                                              K.sub.4 K.sub.3                                                                             17                                                              K.sub.3 K.sub.2                                                                             26                                                              K.sub.2 K.sub.1                                                                             35                                                              K.sub.4       18                                                              K.sub.3       27                                                              K.sub.2       36                                                              K.sub.1       45                                                              ______________________________________                                    

Furthermore, up to this point it was assumed that there is a "known"half and an "unkown" half to the DNA polynucleotide which is beingsequenced in the n+1th reaction, and it is a portion of the overallpiece being sequenced. One approach to a solution is to choose thecorrect "known" part and to set up the corresponding matrix an lattice.Choose one of the polynucleotides from the data which is largest. Guessthat half are "knowns" and half are "unknowns" and guess that the axisbetween the two is at the end of the "knowns" determined in thepreceeding reactions. Based on this guess, a matrix is set up and asolution is attempted using the sequential subsets of the chosenpolynucleotide. If proper sequential subsets are not guessed initially,then other guesses are tried until a suitable set is determined. If aninconsistency is encountered for every set of sequential subsets, thenthe axis is shifted one nucleotide in the 5' direction and this isrepeated up to a number of times equal to 1/2 the number of nucleotidesin the polynucleotide selected until a solution is found. If this isunsuccessful, a polynucleotide one nucleotide less than the selected oneor another polynucleotide of the same length is chosen and the processis repeated. Reiteration of this overall scheme is continued until anunambiguous solution is obtained. From any given data set only onesolution is possible.

The following diagram demonstrates the 5' shift of the axis. ##STR11##

NUMBER OF POLYNUCLEOTIDES NECESSARY TO SOLVE FOR A GIVEN NUMBER OF"UNKNOWN" NUCLEOTIDES

For the polynucleotide generation method, Procedure I, the productsconform to the criteria of Strategy I. The 5' randomization of the 3'random ended polynucleotides in reaction n+1 is carried out by thepartial displacement of the 5' end of the polynucleotides from reactionn+1 with the 3' random ended polynucleotides from reaction n, followedby the subsequent destruction of single stranded nucleic acid. The"axis" is at the position that the 3' end of the longest polynucleotidetransferred from reaction n would superimpose on the template. Maximumdisplacement of the 5' end of any polynucleotide in reaction n+1 is dueto this polynucleotide; therefore, 3' to this point relative to thecomplement of the template, the condition that X_(n) X₁ is assured. Thetotal number of final polynucleotides to be scanned is 1/4x² where x isa variable. 1/2x "unknown" nucleotides are solved from a set ofsequential subsets containing x-1 elements. The solution is obtained byusing the method described in the Matrix Method of Analysis II where theform of the matrix is as appears in examples 1 and 2 of SolvingSequences by Matrix Method of Analysis, pages 35 and 36. The ratio ofthe number of nucleotides solved to the total number of fragments for agiven reaction is: ##EQU1##

In methods of polynucleotide generation involving the extension ofprimers, Procedures II-V, consider the case where every primer isextended in a random fashion so that every primer contains everysuccessive extended polynucleotide containing one more nucleotide thanthe preceeding polynucleotide up to an extension of n nucleotides. Asdescribed previously, if a given primer were superimposed on thetemplate, then the axis relative to the template is at the end of thegiven primer. Since the primers are extended 5' to 3' , then it isimpossible to initiate transcription 3' to the "axis", therefore, thecondition X_(n) -X₁ is assured. Furthermore, all of the primers arerandom on the 3' end, and if the longest and shortest differ by ynucleotides, where y is a variable, then by extending each primerrandomly up to x nucleotides a total of xy different 3' and 5' randompolynucleotides exist following elimination of the primers. The solutionis obtained by using a configuration matrix and method of moving alongthe 45° initiation line as is described in Matrix Method of Analysis I.The configuration matrix in this case contains multiple rectangularmatrices of the type described in Matrix Method of Analysis II. And, thelattices used to solve these matrices are of the form described in thatsection. Also, the "knowns" of the designated longest polynucleotide ofthe first rectangular matrix are assigned from the solution of the"unknown" part of the sequence from the preceeding reaction. The firstrectangular matrix is solved as are other succeeding rectangularmatrices using the method described in Matrix Method of Analysis II. Ingeneral, the method consists of moving along the 45° initiation linewhere the "unknowns" solved for in a preceeding rectangular matrix areused as the "knowns" in a succeeding rectangular matrix to solve formore of the sequence of the complement of the parent in the 3'direction. Then the solution of the sequence of these designated longestpolynucleotides can be used collectively as knowns for a longer firstdesignated longest polynucleotide in a reiteration of the procedure.This occurs when the first designated longest polynucleotide whosesequence is determined by the procedure of Matrix Method of Analysis IIis of length less than the longest polynucleotide isolated. Sequentialsubsets of x-1 polynucleotides are chosen to solve any given rectangularmatrix in which the longest designated polynucleotide contains xnucleotides and at least 2 sets are necessary to solve for a total of x"unknowns" in a given configuration matrix. The ratio or the number of"unknowns" solved to the total number of polynucleotides generated isx/xy=1/y.

For the preferred method, since the 3' random hybrid molecules are 5'randomized by a procedure which only removes RNA, the most 3' junctionof RNA and DNA relative to the complement of the parent represents the"axis". Furthermore, if one RNA copy is isolated when following theprocedure described in Methods I, then each of the polynucleotidesgenerated from subsequent reactions and scanned, yields the solution ofone nucleotide. Thus, the ratio of the number of "unknowns" solved tothe total number of polynucleotides generated is one. If more than oneRNA copy is isolated, then the ratio is x/xR=1/R, where R is the numberof RNA copies isolated.

METHOD OF SEQUENTIAL SUBSETS

The method of sequential subsets is based on arranging thepolynucleotides in a hierarchial fashion such that any givenpolynucleotide of the set contains the exact same composition as thenext smaller polynucleotide, except that it contains one extranucleotide. All of these polynucleotides are generated by replicating atemplate in the 5' to 3' direction from the same initiating point; thus,the order of addition of the nucleotides progressing through the setassigns the sequence of the replicated polynucleotide in the same orderfrom 5' to 3'.

EXAMPLE OF THE METHOD OF SEQUENTIAL SUBSETS

Resolution of 3 polynucleotides of 3 nucleotides all containing 1-A; 2-Tinto the Sequences ATT; TTA; TAT.

    ______________________________________                                        Fragment #   Data 1     Data 2    Data 3                                      ______________________________________                                        1            2-T 1-A    2-T 1-A   2-T 1-A                                     2            1-T 1-A    2-T       1-T 1-A                                     3            1-A        1-T       1-T                                         Reading from the bottom to the top of a column,                               the sequence is:                                                                         ATT      TTA       TAT                                             ______________________________________                                    

DETERMINATION OF THE BASE COMPOSITION AND TERMINAL BASE IDENTITY OF APOLYNUCLEOTIDE

The base composition and the identity of the terminal base ofpolynucleotides can be determined by methods such as chromatography,mass spectrometry, and NMR. For example, isolated polynucleotides can behydrolyzed to mononucleotide subunits. Each ribonucleotide,deoxyribonucleotide, and dideoxyribonucleotide can be identified by itsunique migration time, for example, using techniques such as HPLC, ionexchange chromatography, thin layer chromatography, etc. The relativeamounts of each base can be quantified by absorbance, fluorescence, orby scintillation quantification if the nucleotides or bases areradiolabeled. Only a single dideoxynucleotide is present in eachpolynucleotide; therefore, it serves as the internal standard. Allintensity values of signals corresponding to the different bases arenormalized by the signal corresponding to the dideoxynucleotide todetermine the base composition. Depending on the monitoring techniquethe intensity values may have to be corrected by a calibration factorbefore being normalized.

DETERMINATION OF THE BASE COMPOSITION AND TERMINAL BASE IDENTITY BY MASSSPECTROMETRY

Mass spectrometry can be used to determine the base composition andterminal base. For example, to calibrate the mass spectrometer, spectraof known mixtures of the possible nucleotides or nucleosides areobtained and the intensity as a function of concentration for eachnucleotide or nucleoside is determined from the intensities of thecorresponding peaks of the spectra. When an unknown polynucleotide isfragmented by the electron beam or is prehydrolyzed to nucleotides ornucleosides and a mass spectrum is obtained, the known correspondencebetween a peak or peaks of the spectrum and a given base serves to makebase assignments, and to determine the relative number of bases in thepolynucleotide, the intensity of the peak or peaks corresponding to eachbase is corrected by the calibration factor. Then the base compositionis determined by normalizing the corrected intensities with thatcorresponding to the dideoxyterminal nucleotide which is present onlyonce in the polynucleotide; thus, it serves as internal standard.Calibration of a mass spectrometer is described by Hawley et al.,Nucleic Acid Research, Vol. 5, Number 12, December 1978, pp 4949-4956.

Mass spectroscopy of free bases is also used to determined the basecomposition and terminal base by the same procedure as is used for thenucleotides and nucleosides with exception that base analogues are usedas the dideoxynucleotides. Exemplary base analogues are given in theDetermination of the Base Composition and Terminal Base Identity by ThinLayer Chromatography Section.

DETERMINATION OF THE BASE COMPOSITION AND TERMINAL BASE IDENTITY BYNUCLEAR MAGNETIC RESONANCE LABELING AND SPECTROSCOPY

The base composition and terminal base identity can be determined vianuclear magnetic resonance spectroscopy (NMR). For this purpose, thepolynucleotides are constructed with nucleotides that are labeled withatoms that produce a signal detectable by NMR. The signal correspondingto each possible base and terminal base is distinctive. The basecomposition and terminal base identity can be determined from the knowncorrespondence between the chemical shift of a given peak and given baseand from the normalized intensity of each peak. Normalization isexecuted using the intensity of the signal corresponding to thedideoxyterminal nucleotide which is singularly present in eachpolynucleotide; thus, it serves as an internal standard.

SUMMARY OF THE NMR MODE

The base composition and terminal nucleotide of a polynucleotide can bedetermined by NMR Spectroscopy using selectively NMR-labelednucleotides.

An atom in each deoxynucleotide or ribonucleotide, U, A, T, G and C anda different atom for each dideoxynucleotide are chosen, where thesignals from these atoms occur at chemical shifts which are unique fromeach other. Since each polynucleotide contains one dideoxynucleotide, itserves as the internal standard of signal intensity of one atom, onenucleotide. Therefore, the identity of the nucleotides present in apolynucleotide is determined from the chemical shift of the signal andthe number of such nucleotides in a polynucleotide is determined bynormalizing the signal intensity relative to the dideoxy internalstandard. An example of NMR data for polynucleotide #15 of FIG. 4 isshown in FIG. 7A.

Nucleotides appropriately labeled for this procedure can be synthesizedfrom properly labeled precursors. The proper atoms for labeling can bedetermined from an NMR spectrum of the nucleotide. A spectrum of thelabeled nucleotides will have peaks present at the same chemical shiftas that for the unlabeled spectrum with the absence of the peaks fromnonlabeled atoms.

Two atoms useful for NMR-labels are C¹³ and H¹. C¹³ -labeled nucleotidescan be synthesized so that a single C¹³ atom is substituted for a C¹²atom at a designated position. A different position must be chosen foreach nucleotide and for each dideoxynucleotide. Based on the totalnumber of carbon atoms in each nucleotide, the number of replaceablecarbon atoms in each of the purine and pyrimidine bases are as follows:

    ______________________________________                                                cytosine                                                                             3                                                                      thymine                                                                              3                                                                      adenine                                                                              5                                                                      guanine                                                                              5                                                              ______________________________________                                    

If hydrogen is the isotope detected, nucleotides are synthesized so thatall exchangeable and nonlabeled hydrogens are replaced by deuteriumatoms. Deuterium produces no signal detectable by NMR. Thus, thenon-replaced, non-exchangeable hydrogens serve as the signal-generatingspecies. For each nucleotide a hydrogen is selected which will yield adistinctive chemical shift. The number of non-exchangeable hydrogens onpurine and pyrimidine are as follows:

    ______________________________________                                        cytosine         2                                                            thymine          1 and 3 on CH.sub.3 group                                    adenine          2                                                            guanine          1                                                            ______________________________________                                    

Because a distinct label is needed for each nucleotide when it isinternal and when it is terminating, for guanine, a hydrogen on theribose can serve as label to satisfy this requirement.

Multiple samples of NMR-labeled polynucleotides can be scannedsimulataneously using a field gradient. For deuterium containingnucleotides in a D₂ O solvent, the field strength ranges between 5 to 12Tesla with resonating frequencing between 215 to 516 Megahertz.Therefore, with a chemical shift within any given sample of 10 parts permillion in presence of a field gradient ranging from 5 to 12 Tesla,8×10⁴ samples could be scanned simultaneously; a Fourier transform wouldproduce the frequency spectrum from the equation F(w)=∫f(t)e^(-jwt) dt.The Larmor equation, w=γH, where the field gradient as a function of thespatial coordinates is known, can be used to assign the frequencyspectrum to the individual samples, and, thus, the composition andterminal nucleotide assignment for the individual samples is assigned.

For C¹³ NMR, H₂ O can be used as a solvent provided that protons aredecoupled by irradiating at their resonance frequency. In using a fieldgradient, D₂ O solvents are unnecessary for a field strength less than 8Tesla. Over the range of 2 to 8 tesla with corresponding resonances at1.4 to 85.6 megaherz with a chemical shift of 100 parts per million forC¹³, 1.5×10⁴ samples can be scanned simultaneously and if D₂ O is usedthen higher fields; therefore, more samples can be handledsimultaneously. Thus, by using a field gradient in the case of H¹,80,000 samples can be scanned per second vs one per second if, forexample, a mechanical belt were used to transport samples as they arescanned one at a time. Furthermore, NMR is accurate quantitativelywithin 1%; therefore, samples containing as many as 100 of a givennucleotide per strand can be analyzed and the largest polynucleotidefrom any given reaction can contain as many as 400 nucleotides.

DETAILED DESCRIPTION OF THE NMR MODE

NMR labeled nucleotides can be synthesized chemically by starting withthe properly labeled precursors and synthesizing the nucleotides.Methods for the synthesis of nucleotides are described in the followingreferences:

Yamazek, Okotso, Nucleic Acid Research, vol. 3, no. 1, 1976, pp. 251-258(guanine).

Hilbert, Jansen; Journal of the American Chemical Society, vol. 57,1955, pp. 552-554 (cytosine).

Scherp, Journal of the American Chemical Society, May 1946, pp. 912-913(thymine).

Richter, Loeffler; Journal of the American Chemical Society, vol. 82,June 1960, pp. 3144-3145 (adenine).

Berichte Dan Deutschen Gesellschoft, Jan-April, 1900, p. 1370.

Berichten der Deutschen Chemischen, Gesellshaft Band 2, 1897, p. 22; pp.30-60.

The appropriate atoms for labeling can be determined from the NMRspectra of nucleotides. C¹³ spectral data is set forth in Dorman,Roberts; Proceedings of the National Academy of Sciences, vol. 65, No.1, Jan 1970, pp. 19-26; Jones, Winkley; Proceedings of the NationalAcademy of Sciences, vol. 65, No. 1, Jan 1970, pp. 27-30; Mantsch,Smith, Biochemical and Biophysical Research Communications, vol. 46, No.2, 1972; Topics in Carbon 13 NMR Spectroscopy, George Levy, John Wileyand Sons, New York, 1976.

C¹³ NMR Studies of Biopolymers, pp. 244-248. Structural andStereochemical Applications, pp. 469-478.

CRC Handbook of Chemistry and Physics E 71-E 75.

H¹ NMR spectral references are as follows:

CRC Handbook of Chemistry and Physics E 71-E 75. Bubienko, Uniack,Biochemistry, 1981, 20, 6987-6994. Cheng, Kan, Leutzinger, Biochemistry,1982, 21, 621-630.

C¹³ and H¹ are two candidates for NMR labels where the specific signaldue to the labeled atom is used to quantitate the number of these atomspresent in a sample and therefore the number of molecules of thespecific nucleotide in a hydrolyzed or nonhydrolyzed, purified sample.In the case of C¹³ the nucleotides would be synthesized with a singleC¹³ atom replacing a C¹² atom in the designated position in each of thenucleotides and at a different atom for the terminating nucleotides. Onthe purines and pyrimidines alone the number of replaceable carbons isas follows: cytosine (3), thymine (3), adenine (5), guanine (5). Wherehydrogen is the NMR label, nucleotides would be synthesized so that allnon-exchangeable hydrogens would be replaced by deuterium which has noNMR signal; the hydrogen which is not replaced by deuterium is thelabel. For each nucleotide a hydrogen is selected which has a distinctchemical shift and a different hydrogen is selected for each nucleotideto serve as the label for the terminating nucleotides. The number ofnonexchangeable hydrogens on the purine and pyrimidine molecules is asfollows: cytosine (2), thymine (1+3 on CH₃), adenine (2), guanine (1).For guanine, a hydrogen on the ribose is labeled because the base onlycontains one hydrogen and two distinct hydrogens are necessary to labelboth the deoxy and dideoxynucleotide.

The cost of synthesizing the nucleotides in either case is not importantbecause in both cases the nucleotides are reusable. With respect tonoise C¹³ has an advantage in that any contaminating nucleotides thatare C¹² will not give an NMR signal. However, the natural abundance ofC¹³ is 1.1%, whereas the natural abundance of deuterium is 0.015% andthe natural abundance of H¹ is 99.98%; therefore, there is morebackground noise when C¹³ labeled nucleotides are used. Furthermore, thesignal is much greater for hydrogen because its gyromagnetic ratio ismuch larger (42.6 for H¹ ; 10.7 for C¹³). C¹³ has a relative signal of1.59×10⁻² compared to H¹.

To achieve a strong signal 50 u moles total of C¹³ vs. 500 n moles totalof H¹ are necessary. 1 ml solutions in which the total H¹ concentrationis millimolar whereas 1 ml solutions in which the total C¹³concentration is at least 50 mM will give rise to a good signal.

For H¹, exchangeable hydrogens are replaced by deuterium by evaporatingand resolvating the sample in D₂ O. The NMR sample is less than 1 ml involume, and the cost of D₂ O is $10 for 100 g; therefore, the cost isminor and the D₂ O is recoverable. Also, many NMRs use D₂ O as a signallock. The time for scanning H¹ is less than 1 second whereas, due to thelonger relaxation time, the scanning time of C¹³ is on the order ofseconds to minutes depending on the relaxation mechanisms available tothe carbon nucleus. Also, pulse patterns are available to shorten thescan time for C¹³ but some are not quantitative.

The equipment used to determine the nucleotide composition and terminalnucleotide can consist of an NMR spectrometer such as the Bruker WM-250,Varian HR-220, Nicolet NT-360 or a Bruker WM-500. Samples can be movedmechanically in and out of the NMR scanner with a belt apparatus forexample. The NMR instrument can be designated to implement a fieldgradient and a Fourier transform algorithm to scan many samples at once.Technology of using a field gradient to assign NMR signal intensities tospatial coordinates in a line or in a plane is described in U.S. Pat.Nos. 4,021,726 and 4,115,730 respectively. NMR signal intensities can beassigned in one dimension. Thus, a field gradient need be applied alongone axis. The samples are aligned along this axis and excited with rfsignals and the free induction decay signals are read.

Many samples can be scanned at once using a field gradient. For H¹ usinga D₂ O solvent, conventional field strengths range between 5 to 12 Teslawith resonating frequencies between 215 to 516 Megahertz. Therefore,with a chemical shift within any given sample of 10 parts per million inthe presence of a field gradient ranging from 5 to 12 Tesla, 8×10⁴samples could be scanned simultaneously. A Fourier transform wouldproduce the frequency spectrum from the equation: F(W)=∫f(t)e^(-jwt) dt.The Larmor equation, w=γH, where the field gradient as a function of thespatial coordinates is known, can be used to assign the frequencyspectrum to the individual samples. Therefore, the species in eachsample can be quantified as described previously. The samples could bearranged either in a line or in two dimensions. In the former case, asingle gradient is applied in the direction of the samples, the samplesare excited with rf signals, and the free induction decay signals aremeasured. From this data, intensities of the nucleotides of theindividual samples can be obtained as described previously. This methodis used in the reconstruction of a line in a plane of a body section andis described in U.S. Pat. No. 4,021,726. This patent discloses a meansand method where an additional gradient is necessary to select a planeperpendicular to the gradient along a line in that plane. This methodand means are not necessary in the present invention. If the samples arearranged in two dimensions, then the same methods implemented in theabove patent to reconstruct a cross sectional plane of the human bodycan be used to assign the intensity spectrum of the differentnucleotides in the individual samples. For example, readout signalswhich relate to a single line could be obtained as described previously.And, to obtain information for the complete plane, cycle operations arenecessary in which the sample is scanned successively along a sequenceof lines. Another method is to apply an orthogonal gradient andsimultaneously apply selective rf pulses to select strips in the planeand then apply orthogonal field gradients to the samples of suchrelative magnitudes that each point of the selected strip is subjectedto a resultant magnetic field at an amplitude unique to that point. Thefree induction decay signal is then read out from the strips and theintensity of the signals is assigned spatially. Since the positions ofthe individual samples are known, the intensity spectrum of thenucleotides in the samples is assigned. A modified apparatus and methodas described in U.S. Pat. No. 4,115,730 may be used with the exceptionthat the means and method for isolating a plane do not apply.Additionally, an apparatus and method of implementing the so-called"echo-planar" method of assigning NMR signal intensities spatially maybe used. In this method, as it pertains to the present invention,resonance is established in the plane and then two orthogonal gradientsprovide dispersion in the plane. One gradient is pulsed to space thefrequencies of spins in adjacent strips allowing the other gradient toprovide dispersion down the strips. The FID signals for each distinctsample so produced can be put in an array and two-dimensionally Fouriertransformed to yield the signal intensity spectrum for each sample. Themethod and apparatus of the present invention is described in U.S. Pat.No. 4,335,282 except that the method and apparatus for selecting theplane does not apply.

For C¹³ NMR, H₂ O can be used as a solvent provided that protons aredecoupled by irradiating at their resonance frequency. When a fieldgradient is used, D₂ O solvents are unnecessary for a field strengthless than 8 Tesla and over the range of 2 to 8 Tesla with correspondingresonances at 21.4 to 85.6 Megaherz with a chemical shift of 100 partsper million for C¹³, 1.5×10⁴ samples can be scanned simultaneously. IfD₂ O is used, then higher fields and therefore, more samples can behandled simultaneously. Thus, it can be seen that by using a fieldgradient in the case of H¹, 80,000 samples can be scanned per second vs.1 per second if, for example, a mechanical belt were used to transportsamples as they are scanned one at a time. Furthermore, NMR is accuratequantitatively within 1%; therefore, samples containing as many as 100of a given nucleotide per strand can be analyzed and the largestfragment from any given reaction can contain as many as 400 nucleotides.

An example of the NMR method of the present invention follows:

A family of polynucleotides are separated by polyacrylamide gelelectrophoresis where the nucelotides and terminal dideoxynucleotide ofeach polynucleotide are NMR labeled. Separated fractions can becollected from the gel by instrumentation described in ElectrophoresisInstrumentation. In the case that C¹³ nucleotides are used, thefractions may be collected from buffer well or on glass beads (seeElectrophoresis Instrumentation). When H¹ - deuterium labelednucleotides are used, the preferred method is to collect the DNA onglass beads. This procedure is a modification of Method II described byVogelstein et al., PNAS 76, 615 (1979). Glass beads are present in thelower anode well and the electrophoresis buffer in this well containssaturated NaI as described in the above reference. Following collection,the glass beads are washed with the NMR buffer containing saturated NaI.This step is implemented to remove hydrogen atoms from the sample thatwould produce an NMR signal in the same region as the signal of thenucleotides. The DNA is eluted from the beads by suspending them in NMRbuffer in the absence of NaI.

The isolated polynucleotides are scanned with an NMR spectrometer.

H¹ spectra: If deuterium-H¹ labeled nucleotides are used, then samplesshould be made approximately 1mM in concentration. The urea andelectrophoresis buffer can be removed by ion exchange chromatography.Also, if the glass bead technique is employed as described previouslyand in the Electrophoresis Instrumentation Section, and theelectrophoresis buffer contains urea then the urea and electrophoresisbuffer can be exchanged for phosphate buffer and D₂ O by first rinsingthe beads to which the nucleic acid is adsorbed with this buffersaturated with NaI and then eluting the polynucleotides from the beadsby dissolving them in this NMR buffer in the absence of NaI. The protonspectrum may be obtained at 41° C., 1mM strand concentration at 220 MHZ.If base stacking interactions significantly distort the proton NMRspectra, then the nucleotides can be removed from the ribose ordeoxyribose backbone by adding an acid anhydride which will depurinateand depyrimidate the polynucleotides.

    SO.sub.2 +D.sub.2 O→D.sub.2 SO.sub.3

    SO.sub.3 +D.sub.2 O→D.sub.2 SO.sub.4

    CO.sub.2 +D.sub.2 O→D.sub.2 CO.sub.3

    N.sub.2 O.sub.3 +D.sub.2 O→DNO.sub.2

    N.sub.2 O.sub.5 +D.sub.2 O→DNO.sub.3

    Cl.sub.2 O.sub.7 +D.sub.2 O→DC10.sub.4

    P.sub.2 O.sub.3 +D.sub.2 O→D.sub.3 PO.sub.3

    P.sub.2 O.sub.5 +D.sub.2 O→D.sub.3 PO.sub.4

NaOD can be added to neutralize the solution. When this procedure isused, only a fraction of any polynucleotide sample is used and theremainder is reserved for subsequent steps.

C¹³ NMR Spectra: C¹³ spectra may be gathered at 15.1 MHZ using 25mMaqueous solutions and 1 percent (v/v) p-dioxane as the internalstandard. Spectra can also be obtained using a field gradient asdescribed previously. Proton containing reagents need not be removed,and protons may be decoupled by noise modulation at 600 MHZ. C¹³ NMRspectral methods are discussed by Dorman et al. PNAS 65:19-26 (1970). Ifbase stacking interactions distort the spectra, then part of the samplecan be depurinated and depyrimidated with strong acid and neutralizedwith base.

DETERMINATION OF THE BASE COMPOSITION AND TERMINAL BASE IDENTITY OF ANOLIGO OR POLYNUCELOTIDE BY THIN LAYER CHROMATOGRAPHY AND LABELING MASSSPECTROMETRY

The preferred methods for determining the base composition and terminalbase identity of an oligo- or polynucleotide containing adideoxyterminal nucleotide is via thin layer chromatography and labelingmass spectrometry. In one embodiment, nucleotides, nucleosides, or freebases or reaction products of bases are liberated from the oligo- orpolynucleotide and are separated by thin layer chromatography. Theintensity of distinct chromatographic bands corresponding to differentbases is quantified.

In another embodiment the fifth position of the pentose of thenucleotides of the oligo- or polynucleotide are mass labeled in such afashion that each possible nucleotide and terminal nucleotide releases adistinct mass labeled molecule such as formaldehyde from the fifthposition of the pentose when the oligo or polynucleotide is degraded to3' nucleotides or nucleosides and reacted with periodic acid. Theintensity of the liberated molecules of different masses correspondingto different bases is recorded using a mass spectrometer.

In both embodiments, the intensities of the signals which correspond tothe different bases are normalized with the signal corresponding to theterminal base which serves as the internal standard with a signal ofunity. The base composition and internal base identity are given by thenormalized data.

BACKGROUND

Deoxyribonucleic acid (DNA) and ribonucleic acid (RNA) are chainlikemacromolecules that function in the storage and transfer of geneticinformation. The monomeric units of DNA called deoxyribonucleotides;those of RNA are ribonucleotides. Each nucleotide contains threecharacteristic components; (1) a nitrogenous heterocyclic base, which isa derivative of either pyrimidine or purine; (2) a pentose; and (3) amolecule of phosphoric acid.

Four different deoxyribonucleotides serve as the major components ofDNAs; they differ from each other only in their nitrogenous basecomponents, after which they are named. The four bases characteristic ofthe deoxyribonucleotide units of DNA are the purine derivatives adenineand guanine and the pyrmidine derivatives cytosine and thymine.Similarly, four different ribonucleotides are the major components ofRNAs; they contain the purine bases adenine and guanine and thepyrimidine bases cytosine and uracil.

Deoxyribonucleotides contain as their pentose component2-deoxy-D-ribose; ribonucleotides contain D-ribose. Both pentoses occurin their furanose forms in nucleotides. The pentose is joined to thebase by a B-N glycosyl bond between carbon atom 1 of the pentose andnitrogen atom 9 of purine bases or nitrogen 1 of pyrinidine bases. Thephosphate group of nucleotides is an ester linkage with carbon atom 5 ofthe pentose to form 5' nucleotides or is an ester linkage with carbonatom 3 to form 3' nucleotides. When the phosphate group of a nucleotideis removed by hydrolysis, the structure remaining is called anucleoside.

DNA and RNA are macromolecules which are polymers of nucleotides joinedby phosphodiester bonds of the 3' hydroxyl group of one nucleotide andthe 5' phosphate of another nucleotide. These polymers are formed byenzymes which polymerize an antisense copy of an existing polymer byusing it as a template by a process called replication where a purine isincorporated into the growing copy by being paired with a pyrimidine ofthe template, and where a pyrimidine is incorporated into a growing copyby being paired with a purine of the template. Polymerization isterminated when a nucleotide lacking a 3' OH is incorporated into thegrowing chain, because the next nucleotide can not be linked to form aphosphodiester bond. Such nucleotides which lack a 3' OH have the samebase characteristics as deoxyribonucleotides; and contain 2,3-dideoxy-D-ribose as the pentose are called dideoxynucleotides. Theyare typically used to randomly terminate replication reactions of a DNAor RNA strand. The resulting products are oligo- or polynucleotideswhich are terminated by a dideoxynucleotide. It is important todetermine the base composition and terminal base identity of oligo- orpolynucleotides in order to sequence the strand from which they werereplicated.

SUMMARY OF THE PREFERRED MODES

The component bases nucleosides, and nucleotides of an oligo- orpolynucleotide can easily be separated by electrophoretic orchromatographic means. This property is exploited by the presentinvention to determine the base composition and terminal base identityof an oligo- or polynucleotide by means of thin layer chromatography(TLC).

The oligo- or polynucleotide is degraded to component bases, reactionproducts of bases, nucleosides or nucleotides. An immobile phase andsolvent are chosen which will resolve each degraded component as aunique chromatographic band. The mixture of components are applied tothe TLC plate and elution is carried out until the components areresolved. The identity of the bands are determined from known Rf values.The intensity of each band is determined by fluorescence, absorption,reflectance, or radiodecay quantification, and the intensity of the bandcorresponding to the terminal base is used as the internal standard tonormalize the signals of the bases corresponding to the other componentbases to give the base composition directly.

An alternative approach to exploiting the different physical propertiesof bases, reaction products of bases, nucleosides and nucleotides fortheir separation and quantification is to initially label the pentose ofall component nucleotides which are used in a replication reaction of astrand of unknown sequence to produce the desired oligo- orpolynucleotide for which the base composition and the terminal baseidentity information is then determined via mass spectroscopy.Quantification of each component base can be obtained by analyzing thereaction products derived from the nucleotides of the oligo- orpolynucleotide to record the intensity of the mass signal from eachuniquely labeled reaction product where the correspondence of the masssignal of each labeled reaction product to the particular base is known.Normalization of the signal by that corresponding to thedideoxyterminal, gives the composition directly.

A method of the present invention to determine the base composition andterminal base identity is to uniquely mass label the fifth position ofthe pentose of each nucleotide. These nucleotides are used as substratesin a replication reaction to synthesize the oligo- or polynucleotide.Following its synthesis and isolation, the oligo- or polynucleotide isdegraded to 3' nucleotides or nucleosides, and the B-N glycosyl bondbetween the base and the pentose of the nucleotides or nucleosides ischemically broken to produce diols of carbons 4 and 5 of the pentose.The reaction mixture is treated with periodic acid to release labeledformaldehyde from only the carbon 5 position of the pentose. Theintensity of each distinct mass labeled formaldehyde molecule isrecorded with a mass spectrometer, and the signals are normalized withthat of the signal corresponding to the dideoxyterminal base. From theknown correspondence between the masses of the labels and the bases, thecomposition and terminal base identity is given directly by thenormalized signals.

DETAILED DESCRIPTION OF THE PREFERRED MODES Determination of the BaseComposition and Terminal Base Identity by Thin Layer Chromatography

To determine the base composition and terminal base identity by thinlayer chromatography, the oligo or polynucleotide is subjected to adigest or degradation reaction to yield predictable component speciessuch as component bases, reaction products of bases, nucleosides, ornucleotides. The reaction mixture is applied to a TLC plate and iseluted with a suitable solvent until each of the component bases,reaction products of bases, nucleosides, and/or nucleotides are resolvedfrom each other The intensity of each chromatographic band is quantifiedby absorbance, reflectance, fluorescence, or radioactivity. All bandsare identified from known Rf values, and the signals from all bands arenormalized by the signal of the band which contains the componentreleased form the dideoxyterminal of the oligo or polynucleotide. The Rfvalues and the normalization process gives the base composition andterminal base identity.

The digest of the oligo or polynucleotide can be performed enzymaticallywith an exonuclease to produce 3' or 5' nucleotides which can be furtherenzymatically degraded to nucleosides with alkaline phosphatase.

Also, hydrolysis of the B-N glycosyl bond between carbon atom 1 of thepentose and nitrogen 9 or purine bases or nitrogen 1 of pyrimidine basescan be achieved by heating the oligo or polynucleotide in strong acid torelease free bases. And, reactions exist which will selectively breakone of the purine or pyrimidine B-N glycosyl bond in the presence of theother. Thus, the pyrimidine and purine composition can be determined,and the terminal base can be identified from the combined results of twoseparate reactions where equal volume aliquots are used to run separatedepurination and depyrimidation reactions followed by separation of theproducts of each reaction independently by TLC.

The chromatographic bands are quantified and then normalized with thesignal of the band corresponding to the terminal base, internalstandard. For the case in which free bases or reaction products of freebases are used as the components released from the oligo- orpolynucleotide and separated by TLC, it is necessary that each base orthe reaction product of each base which is released from each of thefour possible terminal nucleotides has a unique mobility so that is canbe separated from the other bases and/or reaction products of bases.Thus, the terminal dideoxynucleotide must possess a base analogue whichis incorporated into DNA by the polymerase as if it where the base forwhich it substitutes. Examples include 5-bromouracil and 5-iodouracil asanalogues of thymine; tubercidin, toyocamycin, formycin,7-deazanebularin, 2-aminopurine, sangivamycin, 2-aminoadenine,2-fluoroadenine, and 8-azaadenine for analogues of adenine;5-methylcytosine, 5-hydroxymethylcytosine, 2-thiocytosine,N4-methylcytosine, N4-acetylcytosine, isocytosine, and 5-azacytosine asanalogues of cytosine; and 7-deazaguanine, crotonoside, 1-methylguanine,N2-methylguanine, N2, N2-dimethylguanine, 7-methylguanine,6-selenoguanine, and 6-thioguanine as analogues of guanine.

Selective chemistry which exclusively releases purine or pyrimidinebases from the oligo- or polynucleotide is that of the Maxam-Gilbertsequencing reactions. The B-N glycosyl purine bond and pyrimidine bondcan be selectively broken by running to completion the G-A formic acidreaction and the T-C hydrazine reaction of the Maxam Gilbert sequencingchemistry, respectively.

An example of the TLC method of the present invention where anexonuclease is used to degrade the oligo- or polynucleotide to component5' nucleotides which are separated by the TLC and quantified follows.

The oligo- or polynucleotide which contains P32 labeled nucleotides withknown activity of approximately 10⁶ dpm/ug are hydrolyzed to 5'nucleotides by incubating with exonuclease as described by Rabin, E. Z.,et. al., Biochimica et Biophysica. Acta, 259, (1972), pp 50-68.

The nucleotides are separated by TLC and the radioactivity of each bandis quantified. The identity of each nucleotide on the TLC page isidentified by its known Rf value. The composition is obtained bynormalizing the signal from each nucleotide with the signal from theterminal which serves as an internal standard of one nucleotide.

TLC has many advantages which include sharpness of resolution, picomolarsensitivity, simplicity, speed, and low cost. In general, the method isas follows: the sample is applied from a micropipette 2cm from the loweredge of a TLC plate as a volume of 0.05 -1 ul. The sample can be applieddirectly without extraction from proteins or change of solvent. (SeeMethods of Enzymology, XXIX, pp. 618-19.) Furthermore, if the volume isgreater than 1 ul, the larger volumes are spotted in portions withintermediate drying. The TLC plate is placed in a developing tank andelution is carried out until the solvent is about 2 cm from the end ofthe plate. After drying, the nucleotide spots can be located underultraviolet light or by auto-radiography. The nucleotide containingareas can be cut from the support and assayed for radioactivity byputting each spot in a liquid scintillation vial, adding scintillationcocktail and counting in a liquid scintillation counter. Also,quantification can be obtained directly by densitometry as discussed inHaer, F.C., An Introduction to Chromatography on Impregnated GlassFiber, Ann Arbor Science Publishers Inc., Ann Arbor, Mich., 1969, pp.55-57. Pertinent references for the Rf values of free bases,nucleosides, nucleotides, and methods of TLC are as follows: CRCHandbook Series in Chromatography, Vol I, 1972, pp. 618-23. Randerath,K., Angew. Chem., 73, 674 (1961) from J. Chromatogr., 9, (1962) p. 35.Coffey, R. G. and R. W. Nenburgh, J. Chromatogr., 11, 376 (1963).Shasha, B. and R. L. Whistler, J. Chromatogr., 14, 532 (1964). Haer, F.C., An Introduction to Chromatography on Impregnated Glass Fiber, AnnArbor Publishers, Inc., Ann Arbor, Mich., 1969, pp. 1-63 and 143-51.Strickland, R. G., Anal. Biochem., 10, 116 (1965). Grippo, B.,Iaccarino, M., Rossi, M., and E. Scarono, Biochim. Biophys. Acta, 95, 1(1965). Wang, K. T., and I. S. Y. Wang, Biochim. Biophys. Acta, 142, 280(1967). Dietrich, C. P., Dietrich, S. M. C., and H. G. Points, J.Chromatogr, 15, 277 (1964). Randerath, K. Agew. Chem. Intern. Ed. Engl.,1, 435 (1962). Randerath, K., and Struck, J., J. Chromatogr., 6, 365(1961). Randerath, K., Biochem. Biophys. Res. Commun., 6, 452 (1961/62).Methods of Enzymology LIX, pp. 62-80. Methods of Enzymology XXIX, pp.618-19 and 285-91. Methods of Enzymology XII, pp. 323-344. Methods ofEnzymology XX, pp. 119-125.

Determination of the Base Composition and Terminal Base Identity By MassSpectrometry

In addition to TLC, the base composition and terminal base identity isalso determined by mass spectrometry.

Because DNA is composed of four different bases, eight unique signalsare necessary to identify each base and possible terminal base of anoligo- or polynucleotide by mass spectroscopy. The mass spectrum ofintact polynucleotides can not be obtained directly because thesemolecules are nonvolatile. The mass spectrum of each nucleotide andnucleoside is extremely complex. Each spectrum consists of peaks as afunction of mass whose relative heights correspond to the relative yieldof charged fragments of the masses at which the peaks are found wherethe charge fragments are generated from the nucleotide or nucleoside asit is broken up by an ionizing electron beam. The relative heights ofthe peaks is thus a function of the production efficiency of the chargedfragments of different masses. The complexity of the spectrum increasesfor a mixture of nucleotides and/or nucleosides, and the relativeintensity of the peaks of a spectrum of a mixture is dependent on therelative production efficiency of the nucleotide and/or nucleosidefragments of different masses which correspond to the peaks of thespectrum as well as the relative concentration of the nucleotides and/ornucleosides of the mixture.

Molecules with less complex mass spectra can be produced chemically fromnucleotides or nucleosides. However, if different reactions areperformed on different nucleotides or nucleosides or if the reactionproduct which is mass spectroscopically analyzed is not the same foreach base, then the determination of the base composition and terminalbase identity becomes complicated. For the former case, the intensity ofthe peak characteristic of each nucleotide or nucleoside andcorresponding to each base is dependent on the extent of the reactionwhich generates the molecules whose mass spectrum possesses the givenpeak. For the latter case, the intensity of the peak characteristic ofeach nucleotide or nucleoside and corresponding to each base isdependent on the production efficiency of the fragment of the peak fromthe given reaction product. For each base and possible terminal base,the production efficiency and extent of reaction must be calibrated andthe intensity signal must be corrected. This necessity is averted by themethod of the present invention which entials labeling the fifthposition of the pentose of each nucleotide of an oligo or polynucleotideand causing the nucleotides or nucleosides released from thismacromolecule to undergo a common reaction to release the same volatilemolecule which is analyzed by mass spectroscopy. Because a commonreaction is performed on all component pentoses, the dependence of therelative intensities of the signals corresponding to different bases onthe extent of reaction is eliminated. And, because the reaction productwhich is an analyzed by mass spectroscopy is the same for each base, therelative intensity of the signal corresponding to each base is dependentonly on the relative concentration of each mass labeled reaction productand not on the ion production efficiency.

The mass labeled molecule which is sensed for all eight possible signalsindicating base composition and terminal base identity of any givenfragment is formaldehyde. Calibration of ion production efficiency ofdifferent sensed species is eliminated where all mass labeledformaldehyde molecules have the same production efficiency. The atomsused for labeling are oxygen, carbon, and hydrogen at the fifth carbonposition of the nucleotide pentose. The atoms and isotopes involved inlabeling are given below with their natural abundances and half lives.

    ______________________________________                                        Isotope H1       H2      H3                                                   ______________________________________                                        % Natural                                                                             99.985    .015   0                                                    Abundance                                                                     Halflife                 12.26 y                                              ______________________________________                                        Isotope C12      C13     C14    016   017   018                               ______________________________________                                        % Natural                                                                             98.89    1.11    0      99.759                                                                              .037  .204                              Abundance                                                                     Halflife                 5730 y                                               ______________________________________                                    

Carbon, hydrogen, and oxygen isotopes are used such that each successiveformaldehyde molecule differs by one atomic mass unit. A method ofsynthesizing sugar molecules labeled in the end position is described bySowden, John C., J. Am. Chem. Soc. 74, (1952) pp. 4377-4379. The masslabeled pentoses are synthesized using this procedure where theappropriate isotopes of carbon, oxygen, and hydrogen are present in thereagents. Possible distinct mass labeled formaldehyde molecules aregiven as follows: ##STR12##

The oligo or polynucleotide, for which the composition and terminal basedata is determined, is hydrolyzed to 3' nucleotides or nucleosides. Forall 3' nucleotides, including ribonucleotides, deoxyribonucleotides, anddideoxyribonucleotides and nucleosides, hydrolysis of the B-N glycosylbond between carbon atom 1 of the pentose and nitrogen 9 of purine basesor nitrogen 1 of pyrimidine bases produces a terminal diol onlyinvolving pentose positions 4 and 5. When oxidized with periodic acidthe 4-5 diol pentose releases a formaldehyde molecule only from thefifth carbon position which contains carbon atom 5 and the two hydrogenatoms and oxygen atom bound to carbon atom 5. Reactions exist which willselectively break one of the purine or pyrimidine B-N glycosyl bond inthe presence of the other. Because the glycosyl bond must be broken toproduce a terminal diol in order for formaldehyde to be released fromthe fifth carbon position, this glycosyl bond breaking chemistry can beused to separately analyze exclusively the purine or pyrimidinecomponents of an oligo or polynucleotide containing both types of basesby the labeling mass spectroscopic method. Examples of methods ofdeterming the purine and pyrmidine components in two separate reactionsand as one reaction follows.

Scheme 1: Release of Formaldehyde from Purines and Pyrmidimes asSeparate Reactions.

For example, exemplary formaldehyde molecules one, two, four, and fiveare used as molecules to identify the number of adenine and guaninebases and the presence of terminal adenine or terminal guanine in anygiven oligo or polynucleotide in one reading and to identify the numberof thymine and cytosine bases and the presence of terminal thymine orterminal cytosine in any given oligo or polynucleotide by a secondreading where the corresponding isotopes of carbon, oxygen, and hydrogenare used to label the fifth position of the nucleotide pentosescorresponding to the different bases. The two independent readings areperformed on samples from two separate reactions where each reactionmixture contains an equal volume aliquot of a hydrolyzed oligo orpolynucleotide. One reaction liberates labeled formaldehyde from purinenucleotide pentoses, and the other liberates labeled formaldehyde frompyrimidine nucleotide pentoses. It is implicit that the signal given bythe terminal nucleotide serves as the internal standard for the signalsof the bases in both reactions. The reaction sequence is as follows:

The oligo or polynucleotide is isolated and hydrolyzed to 3' nucleotidesor nucleosides. For example, the former reaction can be accomplishedwith a nuclease such as exonuclease from B. subtilis. Two equal volumealiquots are taken of the hydrolyzed reaction mixture. One aliquot istreated to break purine B-N glycosyl bonds, and the other is treated tobreak pyrimidine B-N glycosyl bonds where methodology such as that ofthe Maxam-Gilbert sequencing chemistry is used. For example, formic acidis used in the former reaction, and hydrazine is used in the latterwhere the reactions are run to completion. In both cases, breaking theB-N glycosyl bond creates a diol involving carbon positions 4 and 5 ofthe pentoses. A molecule containing such a grouping is oxidized andcleaved by period acid (HIO4). Formaldehyde is released from the fifthcarbon only when the glycosyl bond has been broken. The sample is heatedto liberate formaldehyde gas. And, only the signals of molecules withmass between 30 and 34 daltons are recorded by mass spectroscopy.Formaldehyde is not released from any other pentose position. Assumingthe reactions go to the completion or to the same extent for allpentoses, noise is limited to the purity of the labels. If the 14Clabeled nucleotides are used within 83 years and the 3H labelednucleotides are used within 64 days, the composition and terminal baseidentity is determined accurately for oligo or polynucleotides of lengthof approximately 400 nucleotides by this spectroscopic method. Massspectroscopy offers advantages of a faster processing time and a greatersensitivity than TLC methods. The processing time and sensitivity are afraction of a second and femtomoles for the former and minutes andpicomoles for the latter. And, the formaldehyde molecules differ by atleast one atomic mass unit and cover a narrow mass range; thus, anunsophisticated mass spectrometer is adequate. Thus, this technologyrepresents a valuable alternative to TLC methods.

The mass spectrum of formaldehyde consists of three peaks which are eachseparated by one atomic mass unit. Formaldehyde molecules of differentmasses can contribute intensity to the same peak; thus, the signal froma mixture of mass labeled formaldehyde molecules must be processed.Signal processing involves subtracting the contribution that one of thelower mass peaks of a heavier formaldehyde molecule makes to the peak ofhighest mass of a lighter formaldehyde molecule. The initial correctionscan be derived by multiplying the spectral peak of highest massseparately by the two calibrated numbers which gives the intensities ofthe two additional peaks of the formaldehyde molecule corresponding tothe spectral peak of highest mass. These contributions to the peaks ofone and two atomic mass units less than the most massive peak aresubtracted from the respective spectral peaks. The process is thenrepeated using the now corrected peak of one atomic mass unit less thanthe spectral peak of highest mass. And, the process is reiterated untilthe corrected heaviest peak of each formaldehyde molecule is obtained.These processed signals are used to determine the base composition.

Implicit in the case where a label of mass 32 daltons is used is that 0₂is removed by a scavenging system such as MnO. Alternatively, GC-Massspectrometry is used, or formaldehyde is preoxidized to CO₂ beforerecording. Carbon dioxide gives a distinct line signal in massspectromety which needs no further processing as is necessary for thatof formaldehyde; however, recording CO₂ precludes the scheme ofadditionally labeling the hydrogens of formaldehyde described below.Ultimately, the cost effectiveness of additional labeling and removingO₂ and accuracy will determine which scheme to implement for thespecific application. FIG. 7B is an example of mass spectroscopic datafrom fragment #15 of FIG. 4 where the data was obtained of labeled CO2.Scheme 2: Release of Formaldehyde from Purines and Pyrimidines as aSingle Reaction. For example, exemplary formaldehyde molecules one, two,and four through nine are used to identify the number of adenine,guanine, cytosine, and thymine bases and the presence of terminaladenine or thymine or guanine or cytosine where the correspondingisotopes of carbon, hydrogen, and oxygen are used to label the fifthposition of the nucleotide pentoses corresponding to the differentbases. The oligo or polynucleotide to be analyzed is hydrolyzed to 3'nucleotides or nucleosides as for scheme 1. However, the B-N glycosylbond for all nucleotides is broken in one reaction mixture. This can beachieved by heating in acid for example. The periodic acid oxidation isrun, and formaldehyde of mass 30-38 daltons is released as a gas andquantified with a mass spectrometer. Implicit is that 02 is removed by ascavenging system or by gas chromatography.

ELECTROPHORESIS INSTRUMENTATION SECTION

For each reaction mixture, load the polynucleotides which differ by onebase pair on to X tube gels (where X equals the number of reactionmixtures). The tube gels are arranged in a row and are spaced so thatthe bottom of each is located on a separate disc of a carousel collectorthat contains wells of activated glass beads which adsorb the nucleicacid. The carousel collection apparatus comprises concentric diskscontaining wells of glass beads. The disks rotate independently of eachother as electrophoresed bands are collected individually from the tubegel. The apparatus is shown in FIG. 10D. Thus, there are X concentricdisks, each having Y wells, where Y is the number of samples to becollected from a single tube gel. The tube gels exist as part of anelectrophoresis apparatus shown in FIG. 10A. A super saturated NaIbuffer which enhances the binding of the nucleic acid to the beads isused during electrophoresis. The beads are part of the matrix throughwhich the DNA migrates in the presence of the applied electric field,and the solvent completes the circuit because it readily permeatesthrough the glass beads. See FIG. 10B. Absorption of UV light by nucleicacid is monitored at each tube by a photocell (other methods ofmonitoring bands are mass spectroscopy; scintillation detection ofradiolabeled nucleotides, nucleosides, or bases; ethidium bromidefluorescence detection and conductance perturbation detection) and thedisk is rotated to move the next well into place following each peak.For the case where NMR labeled nucleotides are used, after theseparation procedure is finished, the solvent is removed; the glassbeads are mildly washed with D₂ O, and then the disks are rotated sothat the wells successively align with a hole through which the samplesof glass beads fall into receiving vessels as depicted in figure IOC.The nucleic acid is removed from the glass beads by vigorous solvationin D₂ O. The polynucleotides are hydrolyzed in acid anhydride and an NMRscan is performed. For the case where mass labeled nucleotides are used,the polynucleotides adherent to the glass beads can be reacted torelease formaldehyde directly or the polynucleotides can be liberatedfrom the glass beads by washing the beads with or suspending the beadsin a solvent which does not contain NaI.

Such as recovery of the polynucleotides is necessary for any subsequentsequencing reactions and/or enzymatic digests to release nucleotides ornucleosides which are analyzed by chromatography. Free bases of thepolynucleotide are released from the glass beads by heating them isstrong acid where analysis of the liberated bases is by chromatography.

DNA can also be collected as fractions by removing DNA from a smallanode well (via outlet) which is refilled with electrophoresis buffer(via inlet) after each collection. The electrophoresis apparatus is asdescribed previously and also appears in FIG. 10A. One unit of thecollection apparatus is demonstrated in FIG. 10E, where the collectionapparatus consists of X units - one for each of the tube gels. Each bandis collected individually; thus, Y collections are performed during theelectrophoresis procedure.

AUTOMATING THE METHOD

The general system design of an automated apparatus for the labelingstrategy of scheme 2 is as follows:

a) a first reaction vessel comprising a reaction chamber having meansfor transfer of reagents, reactants and reaction products into and outof the chamber;

b) means for separating individual oligonucleotides and polynucleotideson the basis of length, the separating means having means for receivingreaction products from the first reaction vessel;

c) a second reaction vessel for oxidizing pentose sugars comprising asecond reaction chamber having means for transfer of reactants andreagents into the chamber and gaseous by-products of a reaction out ofthe chamber;

d) means for selectably transferring separated oligonucleotides andpolynucleotides from the separating means alternatively into the firstor the second reaction vessel;

e) a second transfer means for transferring the gaseous by-products outof the second reaction vessel;

f) a chamber for collecting the gaseous by-products, the chamber beingin communication with the second transfer means;

g) third transfer means for transferring gaseous by-products from thecollection chamber to the analyzing means; and

h) means for analyzing the relative abundance of the components of thegaseous by product by mass, the analyzing means being in communicationwith the third transfer means.

The preferred system design of the apparatus to automate the preferredmethod shown schematically in FIG. 2 which uses the labeling strategy ofscheme 2 is shown in FIG. 8. The automation procedure implementing theapparatus of FIG. 8 is as follows. The isolated polynucleotide fragmentis transcribed in RNA using for example a first reaction vesselcomprising reaction chamber and a computer controlled robotic dispensercomprising means to transfer reagents into and out of the chamber tocarry out the reaction. This programmable instrument which performs thisand other reactions does so under controlled conditions by dispensingstock solutions of precise volume and by controlling variables such asthe temperature and time of reaction. This instrument automaticallyperforms the tasks of pipeting reaction products to or from a reactionmixture.

The product RNA copies are separated electrophoretically using theelectrophoretic unit. The electrphoretic bands are electroeluted fromthe gel and pumped using a fluid pump through the open one way valve Bof a selectable two way valve compirsing one way valves A and B into acontinuous flow period acid oxidiation cell. The polynucleotide ishydrolyzed to release pentoses which are oxidized to releaseformaldehyde. The liberated formaldehyde is colllected in theformaldehyde cell and injected into the mass spectrometer. The massspectroscopic signal is used to monitor the electrophoretic bands. Apure electrophoretic band will produce a mass spectroscopic signalcontaining peaks from the individual bases at integer ratios, and theabsolute intensity as a function of time of each peak of the signal willfollow a Gaussian curve. In response to the signal of a pure band, thecomputer executes the closing of one way valve B and the opening of oneway valve A. The peak is collected in the reaction vessel. The isolatedRNA fragment is extended in DNA and terminated randomly by performingthe replication reaction in the presence of all four possibledideoxynucleotides in a single reaction using the robotic dispenser toperform the reaction. The reaction products are separated on theelectrophoretic unit. The electrophoretic bands are electroeluted fromthe gel and pumped through the open one way valve B and into thecontinuous flow periodic acid oxidation cell. The liberated formaldehydeis collected in the formaldehyde cell and injected into the massspectrometer which serves the dual purpose of monitoring theelectrophoretic bands and determining the base composition and terminalbase identity of the polynucleotide of each band. In response to thesignal from each pure band containing a polynucleotide to be 5'randomized, the computer executes the closing of one way valve B and theopening of one way valve A. A portion of the band is collected in thereaction vessel. The isolated hybrid molecule is 5' randomized in areaction using the robotic dispenser to perform the reaction. The 5'randomized reaction products are separated electrophoretically using theelectrophoretic unit. The electrophoretic bands are pumped through theopen one way valve B into the continuous flow periodic acid oxidiationcell. The liberated formaldehyde is collected in the formaldehyde celland injected into the mass spectrometer where the signals give the basecompositions and terminal base of each 5' randomized polynucleotide.

All steps of this process such as driving the robotic dispenser, theelectrophoretic unit, the valve openings and closings, the pumpactivation and deactivation, and the formaldehyde sample injection intothe mass spectrometer are controlled by the computer. The computermonitors the automation of the preferred method using sensors such astemperature sensors for monitoring the chemical reactions and currentand voltage sensors to monitor the electrophoretic unit. The computerstores the data of the mass spectroscopic signals in its memory andprocesses it to generate the base composition and terminal base identitydata of each polynucleotide analyzed. The computer is programmed toidentify pure electrophoretic bands and select: the RNA band to beextended in DNA and the band or bands to be 5' randomized. The computeruses the data to solve the sequence of the fragment by implementing thematrix method of analysis algorithm.

In another embodiment the system design of the apparatus to automate thepreferred method shown schematically in FIG. 2 which uses the labelingstrategy of scheme 1 is as that shown in FIG. 8 with the followingexceptions: 1) there exists a means to transfer reaction products fromthe reaction vessel to the periodic acid oxidizing cell directly and 2)there exists a further oxidizing cell which receives formaldehyde gasand oxidizes it to carbon dioxide which is collected by a carbon dioxidecell (which replaces the formaldehyde cell shown in FIG. 8).

The automation procedure implementing this apparatus is the same as forthe preferred mode described above except that each polynucleotidefraction which is analyzed to determine the base composition andterminal base identity is first collected in the reaction vesselcomprising a reaction chamber and a robotic dispenser and hydrolyzed to3' nucleotides or nucleosides. The 3' nucleotides or nucleosides areselectively depurinated or depyrimidated and then transferred to theperiodic acid acid oxidation cell. The released formaldehyde gas istransferred to a further oxidizing cell where it is oxidized to carbondioxide which is collected in a carbon dioxide cell and injected intothe mass spectrometer. The fraction collections, reactions, transfers,oxidations, and mass spectroscopy are controlled by a programmablecomputer.

The automation apparatus can be expanded by the addition of reactionvessels (each comprising a reaction chamber and a robotic dispenser),electrophoretic units, oxidation cells, and gas chambers which interfacea common mass spectrometer. The signal to noise ratio of a mass signalof a polynucleotide can be improved by acquiring it as an integration byaccumulating multiple mass signals from a given polynucleotide. Multiplemass signals of polynucleotide products of the sequencing reactions ofmultiple fragments can be accumulated with a multichannel analyzer wherethe mass data for each polynucleotide is stored with its channel as themass spectrometer is timeshared between the polynucleotides. Reactionsto sequence different fragments can be performed in parallel to increasethe sequencing rate by a factor of the number of fragments beingsequenced simultaneously.

For multiple fragments sequenced in parallel, the sequencing rate islimited by the time required for the common mass spectrometer to obtaina mass spectrum over the entire mass range. Such as mass spectrum can beobtained for subfemtomole amounts of a polynucleotide by using aspectrometer of the design described in Nature, Vol 310, pp. 105-111(1984) and developed by Hunkapillar, et al.

Gas samples are injected into the ion source of the mass spectrometer.Ions derived from the sample are accelerated, separated according tomass/charge ratio by the fixed electric and magnetic fields of the massspectrometer, and focused onto an integrating detector which useselectro-optical methods to generate and amplify a computer compatiblesignal representative of the total charge accumulated in each ion beam.The signals generated by several thousand discrete detector elementsplaced along the ion focal plane are fed into a real-time arrayprocessor. The partially processed data are stored on a hard disk forlater access by the computer which calculates the base composition andterminal base identity of each sample.

The mass spectrometer simultaneously monitors all ion beams over thepossible mass range with sufficient signal to noise ratio to detect thearrival of single ions at the focal plane. An exemplary massspectrometer of this type is that developed by Hunkapillar et al.described in the above reference and shown schematically in FIG. 9. Itselectro-optical ion detector array has 25,0000 electron-multiplerdetector units per centimeter along the focal plane. Ions impinging onone of the detector elements produce an electron pulse which isconverted to a photon pulse from a phosphor screen (10⁶ totalcomplication). By means of fiber optics, an image of the focal plane isobtained on an array of 5,120 photodiodes, corresponding to a mass rangeof 25-500 atomic mass units with 0.1 AMU discrimination. As theaccumulated charge across a capacitor in each photodiode is a measure ofthe number of ions reaching the corresponding part of the focal plane,the entire mass spectrum can be integrated over times as short asmilliseconds with sensitivity limits below femtomoles. For a total scantime of 10 milliseconds, 100 base pairs per second can be sequencedwhere multiple fragments are sequenced in parallel.

Fourier transform ion cyclotron resonance spectroscopy represents anadditional method of simultaneously quantifying the mass range ofinterest. This instrument is described in the following references:Comisarow, M. G., and Marshall, A. G., Chem. Phys. Lett., 25, (1974)p.282. Comisarow, M. B., Adv. Mass Spec., 7, (1978) p. 1042.

INDUSTRIAL APPLICABILITY

Sequence information determined via the rapid sequencing method willlead to the development of new pharmaceuticals, diagnostic tools, basicscience information, and new cells and organisms which can producedesired chemicals, foods, energy, antibiotics, and therapeutic proteinssuch as insulin and plasminogen activating factor. Three generalprocedures in which the ability to rapidly sequence DNA can be used toobtain a desired gene product. They are as follows:

General Procedure I

1) Find a cell that naturally synthesizes the desired product.

2) Isolate the cell and raise it in tissue culture.

3) Irradiate a portion of the stock.

4) Select mutants which are normal except that they no longer producethe active product.

5) Sequence the non-irradiated stock and the mutants.

6) Identify the region of mutation for each mutant.

7) Isolate the identified gene from the non-irradiated stock by usingrestriction enzymes and electrophoresis where the restriction digestproducts are predicted from the known sequence.

8) Clone in an appropriate expression system.

9) Harvest the desired protein.

General Procedure II

1) Determine the sequence of a large fragment of DNA known to containthe gene of interest.

2) Produce separate, different digests of the fragment using restrictionenzymes which cut the DNA in a predictable fashion based on theelucidated restriction sites in the fragment.

3) Clone each digest separately into a system which is capable ofexpressing the gene product.

4) Determine whether the gene product is expressed for each separate setof clones by assaying for the same product. The product will not beexpressed for those digests which cut in any interior region of thegene.

5) By comparing activity vs. nonactivity to the predicted restrictiondigest sites, the location of the gene in the fragment: can bedetermined.

6) Isolate the gene from the original fragment by using restrictionendonucleases and electrophoresis, where the products of the digest arepredicted from the known sequence.

7) Clone the gene in an appropriate expression system.

8) Harvest the desired protein.

General Procedure III

The protein of interest can be obtained in large, pure quantities byelucidating the nucleotide sequence. From the sequence, together withthe genetic code which is the sequences of three nucleotides in DNA orRNA that specify amino acids when translated into polypeptide chains, apolypeptide can be synthesized. Then a monoclonal antibody can be madeto this polypeptide and the protein of interest can be isolated byaffinity chromatography using the monoclonal antibody.

Products

Enzyme systems can be cloned which can be used to synthesize desiredorganic molecules. This represents a method of chemical synthesis thatdoes not rely on fossil fuels and occurs under conditions of lowpressure and temperature. Also, the products are reasonably pure becauseside reactions are minimal.

The rapid sequencing method could be instrumental in developingorganisms that use waste products as carbon sources to produce fuelssuch as long chain hydrocarbons, methane, and butane.

Further, in the field of food production, genetic information obtainedby this sequencing method would be instrumental in developing nitrogenfixation in cereal grains, prolific super hybrids, and plants withherbicide, insecticide, and draught resistance.

Applications in medicine include discovering genes, the products ofwhich are responsible for controlling developmental processes or forrevascularization. Cloned products of these genes may be used toregenerate diseased organs. Also, diagnostic markers can be developedsuch as hybridization probes for genes associated with birth defects oroncogene rearrangements which represent a diagnostic tool for cancer.Furthermore, by establishing the capability to isolate specific geneproducts, the method would be invaluable in developing gene productssuch as insulin and plasminogen activating factor which are useful inthe treatment of diseases as well as vaccines for the prevention ofdisease. An Elisa assay for oncogene products that can be used todiagnose cancer can be developed by sequencing the oncogene. From thesequence together with the genetic code, a polypeptide of a portion ofthe gene can be synthesized. A monoclonal antibody against thepolypeptide can be made and this antibody can be used in an assay forthe detection of oncogene products in patients suspected of havingcancer.

The potential uses in basic science research are enormous. A smallsample of the possibilities follows. By using a procedure similar tothat described above as General Procedure I, the elucidation of genesresponsible for developmental and for cellular processes is madepossible. The mutants in these cases would lack the normal developmentalor cellular processes. By directly sequencing the DNA fragment ofinterest, information regarding heritability can be determined as wellas inter and intrachromosomal recombination and other genetic eventssuch as translocation and intra and interchromosomal rearrangements.Also, important discoveries may be made such as the discovery of genescontaining long terminal repeat sequences which could be possibleheritable or nonheritable genes responsible for disease. Also,chromosomal break points responsible for birth defects and cancer aswell as translocation events, of oncogenes which are responsible for thedevelopment of cancer can be elucidated via this rapid sequencingmethod. The information in turn will lead to the development of newtreatments and pharmaceuticals.

Equivalents

Those skilled in the art will recognize, or be able to ascertain usingno more than routine experimentation, many equivalents to the specificembodiments of the invention described herein. Such equivalents areintended to be encompassed by the following claims.

I claim:
 1. A method of sequencing DNA comprising the steps of:a)preparing from segments of a DNA strand to be sequenced, families ofpolynucleotides, each family including all polynucleotides,complementary to at least a portion of the DNA segment and at least aportion of the 3' flanking DNA segment of the DNA strand to besequenced, of the formula:

    K.sub.n ' . . . K.sub.4 K.sub.3 K.sub.2 K.sub.1 X.sub.1 X.sub.2 X.sub.3 X.sub.4 . . . X.sub.n

ranging in length from K₁ X₁ to K_(n') -X_(n') wherein K₁ K₂ K₃ K₄ . . .K_(n') represents the nucleotides 5' to an internal reference point, thereference point defined as the dividing line between K₁ and X₁ ; whereinX₁ X₂ X₃ X₄ . . . X_(n) represents the nucleotides 3' to the internalreference point; wherein n and n' are integers and n+n', the number ofnucleotides in a polynucleotide, is less than or equal to the number ofnucleotides in a polynucleotide of length within the analyzable limit ofthe method for determining base composition and identity of the 3'terminal nucleotide of a polynucleotide; and wherein each polynucleotidein the family conforms to the criterion that if the polynucleotidecontains X_(n) it also contains X_(n-1), X_(n-2) . . . X₁ ; or thecriteria that if the polynucleotide contains K_(n') it also containsK_(n'-1') K_(n'-2) . . . K₁ and if the polynucleotide contains X_(n)then it also contains X_(n-1), X_(n-2) . . . X₁ ; or the criteria thatif any two polynucleotides have the same base composition, then theyhave different terminal bases and if any polynucleotide contains X_(n),then it also contains X_(n-1) X_(n-2) . . . X₁ ; b) determining the basecomposition and the identity of the 3' terminal base of eachpolynucleotide of each family; c) determining the base sequence of thelongest polynucleotide in each family from the determined basecomposition and identity of the 3' terminal base of each polynucleotidein the family and the derived change in base composition and terminalbase between polynucleotides in each family; and d) determining the basesequence of the entire DNA strand to be sequenced based upon theoverlapping sequences of the longest polynucleotides in each family. 2.The method of claim 1 wherein the base sequence of the longestpolynucleotide of each family is determined by the Matrix Method ofAnalysis of the base composition of each polynucleotide in the familyand the identity of the 3' terminal base of each polynucleotide.
 3. Themethod of claim 1, wherein the base sequence of the longestpolynucleotide in each set is determined by:a) setting up a matrixconsisting of 1/2M+1 columns and 1/2M rows where M is the number ofnucleotides in the longest polynucleotide of the set; b) assigning thelongest polynucleotide a coordinate position in the matrix of column 1,row 1; c) assigning polynucleotides which are successively onenucleotide shorter on the 5' end to each column position andpolynucleotides which are successively one nucleotide shorter on the 3'end to each row position; d) determining all paths through the matrixfrom position 1,1 to position 1/2M+1, 1/2M which are consistent with thebase composition and the 3' terminal base of the polynucleotide assignedto each position in the matrix and with the change in base compositionand 3' terminal base between polynucleotides; and e) from position1/2M+1, 1/2M determining the path back to position 1,1 which permits theassignment of specific bases at each step either the 5' or 3' end of apolynucleotide, consistent with the compositional and terminal basedata, to arrive at the sequence of the longest polynucleotide.
 4. Themethod of claim 3 wherein the K₁ K₂ K₃ K₄ . . . K_(n') is guessed andsteps d) and e) are performed reiteratively until a sequence can beassigned without contradiction.
 5. A method of sequencing DNA comprisingthe steps of:a) cleaving the DNA to be sequenced to produce fragments ofabout 20 to about 400 nucleotides in length; b) separating and isolatingthe DNA fragments according to size; c) separating and isolating theindividual strands of each fragment; d) preparing RNA/DNA hybridpolynucleotides by:i) making an RNA transcript(s) of at least a portionof the fragment strand; ii) isolating the RNA transcript(s); iii)extending the RNA transcript(s) with deoxyribonucleotides, using the DNAto be sequenced as template for the extension, and terminating theextension randomly to produce a set of RNA/DNA polynucleotides rangingin length up to about 400 nucleotides; e) separating and isolating eachof the RNA/DNA hybrid polynucleotides in each set; f) determining thebase composition and the identity of the 3' terminal base of eachRNA/DNA polynucleotide of each set; g) randomizing the 5' end of atleast one RNA/DNA polynucleotide of length greater than one half thelength of the longest RNA/DNA polynucleotide of each set and at leastone of the smallest RNA/DNA hybrid polynucleotides in the set, toproduce RNA/DNA polynucleotides having an RNA portion containing fromone ribonucleotide to the number of ribonucleotides in the originaltranscript; h) separating and isolating the 5' randomized RNA/DNApolynucleotides; i) determining the base composition and terminal baseof each 5' randomized RNA/DNA hybrid molecule; j) determining the basesequence of the longest RNA/DNA polynucleotide in each set from thedetermined base composition and identity of the 3' terminal base of theRNA/DNA hybrid polynucleotides; and k) determining the base sequence ofthe entire DNA to be sequenced from the overlapping sequences of thelongest polynucleotides in each family.
 6. The method of claim 5,wherein the cleaving step comprises digesting the DNA to be sequencedwith a restriction endonuclease.
 7. The method of claim 5, wherein theDNA fragments are separated by electrophoresis on an agarose orpolyacrylamide gel.
 8. The method of claim 5, whereina) the RNAtranscript(s) is prepared with NMR-labeled ribonucleotides, each type ofribonucleotide labeled with an atom so as to generate a distinctivechemical shift in resonance frequency detectable by NMR spectrometry; b)the RNA transcript(s) are extended in DNA with NMR-labeleddeoxyribonucleotides, each type of deoxyribonucleotide labeled with anatom so as to generate a distinctive chemical shift in resonancefrequency detectable by NMR spectrometry; c) the extension reaction isterminated by including in the extension reaction mixture NMR-labeleddideoxyribonucleotides, each type of dideoxyribonucleotide labeled withan atom so as to generate a distinctive chemical shift in resonancefrequency detectable by NMR spectrometry; d) the base composition andidentity of the 3' terminal base of the polynucleotides is determinedby:i) scanning with an NMR spectrometer each polynucleotide to determinethe chemical shift of the resonance frequency distinctive of each baseand the intensity of the signal at the chemical shift; and ii)normalizing the intensity of each signal with the intensity of thesignal corresponding to the the 3' terminal base to obtain the number ofeach type of base in the polynucleotide.
 9. The method of claim 8wherein multiple samples of polynucleotides are scanned by NMRspectrometer simultaneously to determine the base composition andidentity of the 3' terminal base by:a) placing samples of eachpolynucleotide to be scanned in a predetermined spacial configuration ina magnetic field whose strength varies such that it is different at eachsample location and is known at that location; b) determining the freeinduction decay signal of all samples as a function of time; c)determining the frequency spectrum of all samples by transformation ofthe free induction decay signal from the time to the frequency domain;d) determining the component frequency spectrum of the individual samplefrom the known location of the sample and the magnetic field at thatlocation; e) for each polynucleotide sample, determining from thefrequency spectrum the chemical shift of the resonance frequencydistinctive of each base and the intensity of the signal at the chemicalshift; and f) normalizing the intensity of each signal with theintensity of the signal corresponding to the 3' terminal base to obtainthe number of each type of base in the polynucleotide.
 10. The method ofclaim 5, wherein each type of ribonucleotide, deoxyribonucleotide anddideoxyribonucleotide is labeled with N¹⁵, C¹³, H¹, P¹³ or 0¹⁷.
 11. Themethod of claim 5, wherein the 5' end of the selected RNA/DNApolynucleotides in Step g is randomized by treatment with a partiallyprocessive exoribonuclease that selectively degrades RNA.
 12. The methodof claim 5, wherein the 5' end of the selected RNA/DNA polynucleotidesin step g randomized by hydrolyzing the polynucleotide with mild baseand subsequently degrading cleaved RNA segments with a exoribonucleasethat requires a 3' hydroxyl.
 13. The method of claim 5, wherein thenucleotide sequence of the longest RNA/DNA polynucleotide of each set isdetermined by the MATRIX METHOD OF ANALYSIS of the base composition ofeach polynucleotide in the family and the identity of the 3' terminalbase of each polynucleotide.
 14. A method of sequencing DNA, comprisingthe steps of:a) isolating the DNA to be sequenced; b) preparing 3'randomly ended RNA transcripts of the DNA in multiple reaction mixtures,all transcripts initiating from the 3' end of the DNA to be sequenced,such that for any reaction n the succeeding reaction n+1 results in RNAtranscripts which are on average longer than those in reaction n; c)isolating the RNA transcripts from each reaction; d) extending thetranscripts with deoxyribonucleotides using the DNA to be sequenced astemplate and terminating the extension reaction randomly to produce aset of RNA/DNA hybrid polynucleotides; e) degrading the RNA portion ofthe polynucleotides; f) separating the DNA molecules according to size;g) determining the base composition and the identity of the 3' terminalbase of the set of DNA molecules generated from the transcripts of eachreaction; h) determining the sequence of the longest DNA molecule ineach set from the determined base composition and identity of the 3'terminal base of the DNA molecules; and i) determining the sequence ofthe entire DNA to be sequenced from the region of overlap of the longestDNA molecule of each set.
 15. The method of claim 14, whereina) the RNAtranscripts produced in each reaction mixture are extended in DNA withNMR-labeled deoxyribonucleotides, each type of deoxyribonucleotidelabeled with an atom so as to generate a distinctive chemical shift inresonance frequency detectable by NMR spectrometry; b) the extensionreaction is terminated by including in the extension reaction mixtureNMR-labeled dideoxyribonucleotides, each type of dideoxyribonucleotidelabeled with an atom so as to generate a distinctive chemical shift inresonance frequency detectable by NMR spectrometry; c) the basecomposition and identity of the 3' terminal base of the DNA molecules isdetermined by:i) scanning with an NMR spectrometer each DNA molecule todetermine the chemical shift of the resonance frequency distinctive ofeach base and the intensity of the signal at the chemical shift; and ii)normalizing the intensity of each signal with the intensity of thesignal corresponding to the 3' terminal base to obtain the number ofeach type of base in the polynucleotide.
 16. The method of claim 14,wherein the nucleotide sequence of the longest DNA molecule isdetermined by Matrix Method of Analysis of the base composition andterminal base data.
 17. The method of claim 5, wherein:a) the RNAtranscript is prepared with mass labeled ribonucleotides, each type ofribonucleotide labeled at the 5th position of the ribose with an isotopeor isotopes of carbon, hydrogen or oxygen such that labeled formaldehydeof a unique mass corresponding to each base can be released from thisposition; b) the RNA transcripts are extended in DNA with mass labeleddeoxyribonucleotides, each type of deoxyribonucleotide labeled at the5th position of the deoxyribose with an isotope or isotopes of carbon,hydrogen or oxygen such that labeled formaldehyde of a unique mass canbe released from this position; c) the extension reaction is terminatedby including in the extension reaction a mixture of mass labeleddideoxyribonucleotides labeled at the 5th position of the dideoxyribosewith an isotope or isotopes of carbon, hydrogen, or oxygen such thatlabeled formaldehyde of a unique mass corresponding to each terminalbase can be released from this position; d) the base composition and theidentity of the 3' terminal base of the RNA/DNA hybrid polynucleotide isdetermined by:i) degrading the polynucleotide to produce 3' nucleotidesor nucleosides; ii) hydrolyzing the B--N glycosyl bond between the baseand the pentose; iii) reacting the pentose with periodic acid toliberate formaldehyde from the 5th position of the pentose; iv)determining the relative abundance of the formaldehyde molecules ofdifferent mass with a mass spectrometer; and v) normalizing theintensity of the mass signal of each formaldehyde molecule correspondingto a specific base with the mass signal of the formaldehyde moleculecorresponding to the 3' terminal base to obtain the number of each typeof base in the polynucleotide.
 18. The method of claim 17, wherein thelabeled formaldehyde is oxidized to labeled carbon dioxide which isanalyzed by mass spectrometry.
 19. The method of claim 17, whereina) theB-N glycosyl bond of purines and pyrimidines are hydrolyzed in separatereactions involving separate aliquots of the polynucleotides; b) theselectively hydrolyzed aliquots are oxidized separately with periodicacid to release formaldehyde; c) the released formaldehyde is analyzedby mass spectrometry to determine the base composition and terminal baseof purines and pyrimidines separately where the signals corresponding tothe purines in one mass spectrum and signals corresponding to thepyrimidines in another mass spectrum are normalized by the signalcorresponding to the terminal nucleotide which is present in only one ofthe spectra.
 20. The method of claim 19, wherein the B--N bond ofpurines or pyrimidines is selectively hydrolyzed by the A-G and T-Cdepurinating or depyrimidating reactions of Maxam-Gilbert.
 21. Themethod of claim 5, wherein the base composition and the identity of the3' terminal base of each RNA/DNA hybrid is determined by:a) hydrolyzingthe polynucleotides to nucleotides or nucleosides; b) separating theresulting nucleotides or nucleosides by chromatography; c) quantifyingthe migration bands corresponding to each base and terminal base bymeasuring signals generated by each separated band and correctingsignals which are nonlinear or are base dependent by a calibrationfactor; and d) determining the base composition by normalizing thecorrected signals corresponding to each base with the corrected signalscorresponding to the terminal base.
 22. The method of claim 5, whereinthe terminal dideoxynucleotides contain base analogues and wherein thebase composition and the identity of the 3' terminal base of eachRNA/DNA hybrid is determined by:a) hydrolyzing the polynucleotides tofree bases; b) separating the free bases by chromatography; c)quantifying the migration bands corresponding to each base and terminalbase by measuring signals generated by each separated band andcorrecting signals which are nonlinear or are base dependent by acalibration factor; and d) determining the base composition bynormalizing the corrected signals corresponding to each base with thecorrected signals corresponding to the terminal base.
 23. The method ofclaim 21 or 22, wherein separation in step b is done by thin layerchromatography, ion exchange chromatography, reverse phasechromatography or high performance liquid chromatography (HPLC).
 24. Themethod of claim 21 or 22 wherein migration bands are quantified byabsorption, fluorescence, reflectance, conductance, or scintillation.25. The method of claim 5, wherein the base composition and the identityof the 3' terminal base of each RNA/DNA hybrid is determined by:a)hydrolyzing the polynucleotides to nucleotides or nucleosides orfragmenting the polynucleotides to nucleotides or nucleosides by anelectron beam; b) obtaining the mass spectrum of the nucleotides ornucleosides by mass spectrometry; c) correcting the mass spectrometricsignals corresponding to each base and the terminal base which arenonlinear or are base dependent by a calibration factor; and d)determining the base composition by normalizing the corrected signalcorresponding to each base with the corrected signal corresponding tothe terminal base.
 26. The method of claim 22, wherein the terminaldideoxynucleotides contain base analogs and wherein the base compositionand the identity of the 3' terminal base of each RNA/DNA hybrid isdetermined by:a) hydrolyzing the polynucleotides to free bases; b)obtaining the mass spectrum of the free bases by mass spectrometry; c)correcting the mass signal corresponding to each base and terminal basewhich are nonlinear or base dependent by a calibration factor; d)determining the base compositions by normalizing the corrected masssignal corresponding to each base with the corrected mass signalcorresponding to the terminal base.
 27. A method of claim 5, wherein:a)a single RNA transcript of the fragment strand is isolated and the setof RNA/DNA polynucleotides is prepared from this transcript; b) thesequence is assigned in the 5' to 3' direction by the change innucleotide composition of a polynucleotide of the set compared to apolynucleotide of the set of length greater by one nucleotide, where thecomposition and terminal base data of the polynucleotides of the DNAextention reaction are used; c) an RNA/DNA hybrid is randomized as instep g and the sequence is assigned in 3' to 5' direction by the changein nucleotide composition of a polynucleotide of the set compared to apolynucleotide of the set of length greater by one nucleotide where thecomposition and terminal base data of the polynucleotides of the 5'randomization reaction are used.
 28. The method of claim 5, wherein eachtype of ribonucleotide, deoxyribonucleotide, and dideoxyribonucleotideis labeled at the 5th position of the pentose with one or more of C¹²,C¹³, C¹⁴, H¹, H², H³, O¹⁶, O¹⁷, and O¹⁸.
 29. The method of claim 14,whereina) the RNA transcripts produced in each reaction mixture areextended in DNA with mass labeled deoxyribonucleotides, each type ofdeoxyribonucleotide labeled at the 5th position of the deoxyribose withan isotope or isotopes of carbon, hydrogen, or oxygen such that labeledformaldehyde of a unique mass corresponding to each base can be releasedfrom this position; b) the extension reaction is terminated by includingin the extension reaction a mixture of mass labeled dideoxynucleotides,each type of dideoxynucleotide labeled at the 5th position of thedideoxyribose with an isotope or isotopes of carbon, hydrogen or oxygensuch that labeled formaldehyde of a unique mass corresponding to eachterminal base can be liberated from this position; c) the basecomposition and identity of the terminal base of the DNA molecule isdetermined by:i) degrading the polynucleotide to 3' nucleotides ornucleosides; ii) hydrolyzing the B--N glycosyl bond between the base andthe pentose; iii) reacting the pentose with periodic acid to liberateformaldehyde from the 5th position; iv) determining the relativeabundance of formaldehyde molecules of different mass with a massspectrometer; and v) normalizing the intensity of the mass signal ofeach formaldehyde molecule corresponding to a specific base with themass signal of the formaldehyde molecule corresponding to the 3'terminal base to obtain the number of each type of base in thepolynucleotide.